|
This program is a result of my diploma thesis (with the same title). It is a distributed search engine, consisting of multiple search nodes which report their results to a master server. Each node should be responsible for indexing and querying a local node (ideally this is a single web server). The nodes are connected in a hierachical way. Every super node can execute a query with its own index and it can query all (or a subset) of its sub nodes. This is done by determining the sub-nodes which can give the best results for the query.
It is possible to query every node directly, even if it is not the top-node. It will then use its own data and the data of its sub-nodes for answering the query.
Special features are:
- distributed search engine
- tolerant against writing errors and other words formes
- separated data server and data gatherer
- can support many file formats via plugin mechanism, supported are
- PDF
- HTML
- plain text
- ZIP and gzip files
- uses any relational database (maybe with small changes because of differences in the SQL dialect)
- tested with InstantDB and Oracle Lite
- can gather data via HTTP or from local file system
- HTTP spider is resistent against loops (if a document links against itself, but in another path)
- HTTP Spider is resitsnet againss HTML errors (like missing "'s in parameters or non-quoted &'s)
For the fault tolerance, a so-called "trigram index" is used. This index takes all trigrams (3-letter-combinations) which a word contains, and stores this information in a reverse index. From the words to the documents there is another reverse index. This gives a high speed for queries and a tolerance against mis-spelled words (either in the searched documents or in the query). It can also find substrings in words.
URL: http://www.hendriklipka.de/java/ldse.html
Licence: GPL
Related Tips
|
Page 1 of 0 ( 0 comments )
You can share your information about this topic using the form below!
Please do not post your questions with this form! Thanks.