java left logo
java middle logo
java right logo
 

Home arrow Java Applications arrow Search Engine arrow LDSE
 
 
Main Menu
Home
Java Tutorials
Book Reviews
Java SE Tips
Java ME Tips
Java EE Tips
Other API Tips
Java Applications
Java Libraries
Java Games
Java Network
Java Forums
Java Blog




Most Visited Tips
Java SE Tips
Java ME Tips
Java EE Tips
Other API Tips
Java Applications
Java Libraries
Java Games
Book Reviews
Top Rated Tips
Java SE Tips
Java ME Tips
Java EE Tips
Other API Tips
Java Applications
Java Libraries
Java Games
Book Reviews


Statistics
Registered Users: 4092
Java SE Tips: 614
Java ME Tips: 202
Java EE Tips: 183
Other API Tips: 779
Java Applications: 298
Java Libraries: 209
Java Games: 16
Book Reviews:
 
 
 
LDSE E-mail
User Rating: / 2
PoorBest 

This program is a result of my diploma thesis (with the same title). It is a distributed search engine, consisting of multiple search nodes which report their results to a master server. Each node should be responsible for indexing and querying a local node (ideally this is a single web server). The nodes are connected in a hierachical way. Every super node can execute a query with its own index and it can query all (or a subset) of its sub nodes. This is done by determining the sub-nodes which can give the best results for the query.

It is possible to query every node directly, even if it is not the top-node. It will then use its own data and the data of its sub-nodes for answering the query.

Special features are:

  • distributed search engine
  • tolerant against writing errors and other words formes
  • separated data server and data gatherer
  • can support many file formats via plugin mechanism, supported are
  • PDF
  • HTML
  • plain text
  • ZIP and gzip files
  • uses any relational database (maybe with small changes because of differences in the SQL dialect)
  • tested with InstantDB and Oracle Lite
  • can gather data via HTTP or from local file system
  • HTTP spider is resistent against loops (if a document links against itself, but in another path)
  • HTTP Spider is resitsnet againss HTML errors (like missing "'s in parameters or non-quoted &'s)

For the fault tolerance, a so-called "trigram index" is used. This index takes all trigrams (3-letter-combinations) which a word contains, and stores this information in a reverse index. From the words to the documents there is another reverse index. This gives a high speed for queries and a tolerance against mis-spelled words (either in the searched documents or in the query). It can also find substrings in words.

URL: http://www.hendriklipka.de/java/ldse.html
Licence: GPL


 Related Tips

 
< Prev   Next >

Page 1 of 0 ( 0 comments )

You can share your information about this topic using the form below!

Please do not post your questions with this form! Thanks.


Name (required)


E-Mail (required)

Your email will not be displayed on the site - only to our administrator
Homepage(optional)



Comment Enable HTML code : Yes No



 
       
         
     
 
 
 
   
 
 
java bottom left
java bottom middle
java bottom right
RSS 0.91 FeedRSS 1.0 FeedRSS 2.0 FeedATOM FeedOPML Feed

Home - About Us - Privacy Policy
Copyright 2005 - 2008 www.java-tips.org
Java is a trademark of Sun Microsystems, Inc.