Thursday, April 19, 2007

Nutch Introduction

Introduction to Nutch, Part 2: Searching
part one
of this two part series on Nutch, the
open-source Java search engine, we looked at how to crawl websites.
Recall that the Nutch crawler system produces three key data

  • The WebDB containing the web graph of pages and
  • A set of segments containing the raw data retrieved from
    the Web by the fetchers.
  • The merged index created by indexing and de-duplicating
    parsed data from the segments.
  • In this article, we turn to searching. The Nutch search
    system uses the index and segments generated during the crawling
    process to answer users' search queries. We shall see how to get
    the Nutch search application up and running, and how to customize
    and extend it for integration into an existing website. We'll also
    look at how to re-crawl sites to keep your index up to date--a
    requirement of all real-world search engines.

