Search Engine

Egothor is an Open Source, high-performance, full-featured text search engine written entirely in Java. It is technology suitable for nearly any application that requires full-text search, especially cross-platform. It can be configured as a standalone engine, metasearcher, peer-to-peer HUB, and, moreover, it can be used as a library for an application that needs full-text search.

Key features of egothor

  • Written in Java for cross platform compatibility.
  • Able to recognize many of the most familiar file formats: HTML, PDF, PS, and Microsoft's DOC, and XLS.
  • High capacity robot which supports robots.txt recommendation.
  • The best compression methods are used, i.e. Golomb, Elias-Gamma, Block coding.
  • Based on the extended Boolean model which can operate as the Vector or Boolean models.
  • Universal stemmer that processes any language.
  • New dynamization algorithm for fast index updating.

URL: http://www.egothor.org/
Licence: BSD License

Nutch is a nascent effort to implement an open-source web search engine.

Web search is a basic requirement for internet navigation, yet the number of web search engines is decreasing. Today's oligopoly could soon be a monopoly, with a single company controlling nearly all web search for its commercial gain. That would not be good for users of the internet.

Nutch provides a transparent alternative to commercial web search engines. Only open source search results can be fully trusted to be without bias. (Or at least their bias is public.) All existing major search engines have proprietary ranking formulas, and will not explain why a given page ranks as it does. Additionally, some search engines determine which sites to index based on payments, rather than on the merits of the sites themselves. Nutch, on the other hand, has nothing to hide and no motive to bias its results or its crawler in any way other than to try to give each user the best results possible.

Nutch aims to enable anyone to easily and cost-effectively deploy a world-class web search engine. This is a substantial challenge. To succeed, Nutch software must be able to:

  • fetch several billion pages per month
  • maintain an index of these pages
  • search that index up to 1000 times per second
  • provide very high quality search results
  • operate at minimal cost

URL: http://lucene.apache.org/nutch/
Licence: Apache License

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

URL: http://lucene.apache.org/java/docs/index.html
Licence: Apache License

BDDBot is a web robot, search engine, and web server written entirely in Java(TM). It was written by Tim Macinta for his book (co-authored with Wes Sonnenreich), a Web Developer's Guide to Search Engines by Wiley Publishing. It was written as an example for a chapter on how to write your search engines, and as such it is very simplistic. While not as heavy duty as other free search engines such as ht://Dig, the BDDBot offers the following advantages:

  • Its simplicity makes it a good learning tool for how search engines work. The aforementioned book provides a good top-level overview of how it works so please go buy the book (insert goofy smiley face emoticon here).
  • Its simplicity also makes it easily expandable. You can very easily expand it so that it can index document types besides HTML and plain text. You can also very easily expand it so that it can crawl using different protocols (e.g., gopher, wais) by using the standard Java method for adding protocols.
  • It comes with its own built in web server - we don't know of any other free search engine out there that does this. If you do, please let us know.
  • It's completely free, ala the GNU General Public License. ht://Dig is the only other free search engine we know of that's under the GPL.
  • It's written in Java, which provides several advantages in and of itself. Because it's written in Java:
  • The BDDBot can run on any machine that has a stable Java Virtual Machine (at least as long as Microsoft continues to fail at making Java a Windows specific language).
  • It is in an easy to understand and powerful language.
  • It is object oriented for even greater extensibility.
  • It's very small - just over 100K including source code, binaries, and configuration files at last count.
  • Its indexes are very small. They are on the order of 10% of the size of the text on your site even though they index every single alphanumeric word.

Please keep in mind that the BDDBot was written in about half a week, and that is why it's quite simplistic in most places.

URL: http://www.twmacinta.com/bddbot/
Licence: GPL

Zilverline is what you could call a 'Reverse Search Engine': Zilverline is a search engine that offers web access to your personal or intranet content.

Zilverline is a 'Lucene Desktop' comparable to Google Desktop, but based on Lucene.

Zilverline supports collections: a set of files and directories in a directory. Zilverline extracts content from PDF, Word, Excel, Powerpoint, RTF, txt, java, CHM as well as zip, rar, and many other archives. A collection can be indexed, and searched. The results of the search can be retrieved from local disk or Intranet. Files inside zip, rar, chm and other archives are extracted during indexing, and can be preserved for searches. Otherwise they are extracted 'on-the-fly'.

You can store indexes and caches wherever you like, you could for instance store them on a DVD, as long as Zilverline (and your webserver) can access them.

Indexes can be created incrementally as well as totally. Incremental indexes will pick up new files. It will not remove files from the index. The indexing proces can be scheduled to run automatically, see scheduling. Or you can use a SOAP webservice.

If you supply a url with your collection, the search results will map to the url instead of the source. This allows you to retrieve your result hits from your webserver, instead of disk.

Zilverline is internationalized (English, Chinese, French, German, Spanish, Brazilian and Dutch), and has support for skins (now three).

Zilverline is built in Java on top of Lucene and Spring. You need a Servlet Engine, such as Tomcat to run it. I'm using version 5.0.28.

URL: http://www.zilverline.org/
Licence: Proprietary

The YaCy project is a new approach to build a P2P-based Web indexing network.

  • Search your own or the global index
  • Crawl your own pages or start distributed crawling
  • Run your peer to support other YaCy crawlers
  • Provide Information on your peer using the built-in http-server, file-sharing zone and wiki
  • Built-in caching http proxy
  • Indexing benefits from the proxy cache; private information is not stored or indexed
  • Usage of the proxy is not a requisite for web indexing, but it enables you to access the new top-level-domains '.yacy'
  • Filter unwanted content like ad- or spyware; share your web-blacklist with other peers
  • Easy installation! No additional database required!
  • No central server!
  • GPL'ed, freeware

URL: http://www.yacy.net/yacy/
Licence: GPL