java left logo
java middle logo
java right logo
 

Home arrow Java Applications
 
 
Main Menu
Home
Java Tutorials
Book Reviews
Java SE Tips
Java ME Tips
Java EE Tips
Other API Tips
Java Applications
Java Libraries
Java Games
Java Network
Java Forums
Java Blog




Most Visited Tips
Java SE Tips
Java ME Tips
Java EE Tips
Other API Tips
Java Applications
Java Libraries
Java Games
Book Reviews
Top Rated Tips
Java SE Tips
Java ME Tips
Java EE Tips
Other API Tips
Java Applications
Java Libraries
Java Games
Book Reviews


Statistics
Registered Users: 4084
Java SE Tips: 614
Java ME Tips: 202
Java EE Tips: 183
Other API Tips: 779
Java Applications: 298
Java Libraries: 209
Java Games: 16
Book Reviews:
 
 
 
WebSPHINX E-mail
User Rating: / 9
PoorBest 

WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for web crawlers. A web crawler (also called a robot or spider) is a program that browses and processes Web pages automatically.

WebSPHINX consists of two parts: the Crawler Workbench and the WebSPHINX class library.

Crawler Workbench

The Crawler Workbench is a graphical user interface that lets you configure and control a customizable web crawler. Using the Crawler Workbench, you can:

  • Visualize a collection of web pages as a graph
  • Save pages to your local disk for offline browsing
  • Concatenate pages together for viewing or printing them as a single document
  • Extract all text matching a certain pattern from a collection of pages.
  • Develop a custom crawler in Java or Javascript that processes pages however you want.

WebSPHINX class library

The WebSPHINX class library provides support for writing web crawlers in Java. The class library offers a number of features:

  • Multithreaded Web page retrieval in a simple application framework
  • An object model that explicitly represents pages and links
  • Support for reusable page content classifiers
  • Tolerant HTML parsing
  • Support for the robot exclusion standard
  • Pattern matching, including regular expressions, Unix shell wildcards, and HTML tag expressions. Regular expressions are provided by the Apache jakarta-regexp regular expression library.
  • Common HTML transformations , such as concatenating pages , saving pages to disk, and renaming links

URL: http://www.cs.cmu.edu/~rcm/websphinx/
Licence: Apache License


 Related Tips

 
< Prev   Next >

Page 1 of 0 ( 0 comments )

You can share your information about this topic using the form below!

Please do not post your questions with this form! Thanks.


Name (required)


E-Mail (required)

Your email will not be displayed on the site - only to our administrator
Homepage(optional)



Comment Enable HTML code : Yes No



 
       
         
     
 
 
 
   
 
 
java bottom left
java bottom middle
java bottom right
RSS 0.91 FeedRSS 1.0 FeedRSS 2.0 FeedATOM FeedOPML Feed

Home - About Us - Privacy Policy
Copyright 2005 - 2008 www.java-tips.org
Java is a trademark of Sun Microsystems, Inc.