Thursday, July 12, 2007

Unweaving a Tangled Web With HTMLParser and Lucene

Unweaving a Tangled Web With HTMLParser and Lucene


Ever wanted to write a Java program that crawls the web? You
know a program that reads HTML-pages, retrieves the links, gets
the new pages--with more links and so on. Maybe you also have
thought about storing the text from the HTML pages for later
use, to be able to search for specific information in the pages
for example. These are the characteristics of a search engine
like Google or Yahoo. If you have a web site of your own you
might be interested in having your own search engine. One
possibility is to buy one, or use an Open Source search engine,
but you might also find it rewarding to write your own!


In this article I'll show you the basic technique in building a search engine
using two powerful Open Source products:
HTMLParser and
Lucene.

0 comments: