Introduction to Nutch, Part 1: Crawling

来源:互联网 发布:工程造价软件哪个好 编辑:程序博客网 时间:2024/06/02 15:04

http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html

Nutch is an open source Java implementation of a search engine. It provides all of the tools you need to run your own search engine. But why would anyone want to run their own search engine? After all, there's always Google. There are at least three reasons.

  1. Transparency. Nutch is open source, so anyone can see how the ranking algorithms work. With commercial search engines, the precise details of the algorithms are secret so you can never know why a particular search result is ranked as it is. Furthermore, some search engines allow rankings to be based on payments, rather than on the relevance of the site's contents. Nutch is a good fit for academic and government organizations, where the perception of fairness of rankings may be more important.
  2. Understanding. We don't have the source code to Google, so Nutch is probably the best we have. It's interesting to see how a large search engine works. Nutch has been built using ideas from academia and industry: for instance, core parts of Nutch are currently being re-implemented to use the Map Reduce distributed processing model, which emerged from Google Labs last year. And Nutch is attractive for researchers who want to try out new search algorithms, since it is so easy to extend.
  3. Extensibility. Don't like the way other search engines display their results? Write your own search engine--using Nutch! Nutch is very flexible: it can be customized and incorporated into your application. For developers, Nutch is a great platform for adding search to heterogeneous collections of information, and being able to customize the search interface, or extend the out-of-the-box functionality through the plugin mechanism. For example, you can integrate it into your site to add a search capability.

Nutch installations typically operate at one of three scales: local filesystem, intranet, or whole web. All three have different characteristics. For instance, crawling a local filesystem is reliable compared to the other two, since network errors don't occur and caching copies of the page content is unnecessary (and actually a waste of disk space). Whole-web crawling lies at the other extreme. Crawling billions of pages creates a whole host of engineering problems to be solved: which pages do we start with? How do we partition the work between a set of crawlers? How often do we re-crawl? How do we cope with broken links, unresponsive sites, and unintelligible or duplicate content? There is another set of challenges to solve to deliver scalable search--how do we cope with hundreds of concurrent queries on such a large dataset? Building a whole-web search engine is a major investment. In " Building Nutch: Open Source Search," authors Mike Cafarella and Doug Cutting (the prime movers behind Nutch) conclude that:

... a complete system might cost anywhere between $800 per month for two-search-per-second performance over 100 million pages, to $30,000 per month for 50-page-per-second performance over 1 billion pages.

This series of two articles shows you how to use Nutch at the more modest intranet scale (note that you may see this term being used to cover sites that are actually on the public internet--the point is the size of the crawl being undertaken, which ranges from a single site to tens, or possibly hundreds, of sites). This first article concentrates on crawling: the architecture of the Nutch crawler, how to run a crawl, and understanding what it generates. The second looks at searching, and shows you how to run the Nutch search application, ways to customize it, and considerations for running a real-world system.

Nutch Vs. Lucene

Nutch is built on top of Lucene, which is an API for text indexing and searching. A common question is: "Should I use Lucene or Nutch?" The simple answer is that you should use Lucene if you don't need a web crawler. A common scenario is that you have a web front end to a database that you want to make searchable. The best way to do this is to index the data directly from the database using the Lucene API, and then write code to do searches against the index, again using Lucene. Erik Hatcher and Otis Gospodnetić's Lucene in Action gives all of the details. Nutch is a better fit for sites where you don't have direct access to the underlying data, or it comes from disparate sources.

Architecture

Nutch divides naturally into two pieces: the crawler and the searcher. The crawler fetches pages and turns them into an inverted index, which the searcher uses to answer users' search queries. The interface between the two pieces is the index, so apart from an agreement about the fields in the index, the two are highly decoupled. (Actually, it is a little more complicated than this, since the page content is not stored in the index, so the searcher needs access to the segments described below in order to produce page summaries and to provide access to cached pages.)

The main practical spin-off from this design is that the crawler and searcher systems can be scaled independently on separate hardware platforms. For instance, a highly trafficked search page that provides searching for a relatively modest set of sites may only need a correspondingly modest investment in the crawler infrastructure, while requiring more substantial resources for supporting the searcher.

We will look at the Nutch crawler here, and leave discussion of the searcher to part two.

The Crawler

The crawler system is driven by the Nutch crawl tool, and a family of related tools to build and maintain several types of data structures, including the web database, a set of segments, and the index. We describe all of these in more detail next.

The web database, or WebDB, is a specialized persistent data structure for mirroring the structure and properties of the web graph being crawled. It persists as long as the web graph that is being crawled (and re-crawled) exists, which may be months or years. The WebDB is used only by the crawler and does not play any role during searching. The WebDB stores two types of entities: pages and links. A page represents a page on the Web, and is indexed by its URL and the MD5 hash of its contents. Other pertinent information is stored, too, including the number of links in the page (also called outlinks); fetch information (such as when the page is due to be refetched); and the page's score, which is a measure of how important the page is (for example, one measure of importance awards high scores to pages that are linked to from many other pages). A link represents a link from one web page (the source) to another (the target). In the WebDB web graph, the nodes are pages and the edges are links.

A segment is a collection of pages fetched and indexed by the crawler in a single run. The fetchlist for a segment is a list of URLs for the crawler to fetch, and is generated from the WebDB. The fetcher output is the data retrieved from the pages in the fetchlist. The fetcher output for the segment is indexed and the index is stored in the segment. Any given segment has a limited lifespan, since it is obsolete as soon as all of its pages have been re-crawled. The default re-fetch interval is 30 days, so it is usually a good idea to delete segments older than this, particularly as they take up so much disk space. Segments are named by the date and time they were created, so it's easy to tell how old they are.

The index is the inverted index of all of the pages the system has retrieved, and is created by merging all of the individual segment indexes. Nutch uses Lucene for its indexing, so all of the Lucene tools and APIs are available to interact with the generated index. Since this has the potential to cause confusion, it is worth mentioning that the Lucene index format has a concept of segments, too, and these are different from Nutch segments. A Lucene segment is a portion of a Lucene index, whereas a Nutch segment is a fetched and indexed portion of the WebDB.

The crawl tool

Now that we have some terminology, it is worth trying to understand the crawl tool, since it does a lot behind the scenes. Crawling is a cyclical process: the crawler generates a set of fetchlists from the WebDB, a set of fetchers downloads the content from the Web, the crawler updates the WebDB with new links that were found, and then the crawler generates a new set of fetchlists (for links that haven't been fetched for a given period, including the new links found in the previous cycle) and the cycle repeats. This cycle is often referred to as the generate/fetch/update cycle, and runs periodically as long as you want to keep your search index up to date.

URLs with the same host are always assigned to the same fetchlist. This is done for reasons of politeness, so that a web site is not overloaded with requests from multiple fetchers in rapid succession. Nutch observes the Robots Exclusion Protocol, which allows site owners to control which parts of their site may be crawled.

The crawl tool is actually a front end to other, lower-level tools, so it is possible to get the same results by running the lower-level tools in a particular sequence. Here is a breakdown of what crawl does, with the lower-level tool names in parentheses:

  1. Create a new WebDB (admin db -create).
  2. Inject root URLs into the WebDB (inject).
  3. Generate a fetchlist from the WebDB in a new segment (generate).
  4. Fetch content from URLs in the fetchlist (fetch).
  5. Update the WebDB with links from fetched pages (updatedb).
  6. Repeat steps 3-5 until the required depth is reached.
  7. Update segments with scores and links from the WebDB (updatesegs).
  8. Index the fetched pages (index).
  9. Eliminate duplicate content (and duplicate URLs) from the indexes (dedup).
  10. Merge the indexes into a single index for searching (merge).

After creating a new WebDB (step 1), the generate/fetch/update cycle (steps 3-6) is bootstrapped by populating the WebDB with some seed URLs (step 2). When this cycle has finished, the crawler goes on to create an index from all of the segments (steps 7-10). Each segment is indexed independently (step 8), before duplicate pages (that is, pages at different URLs with the same content) are removed (step 9). Finally, the individual indexes are combined into a single index (step 10).

The dedup tool can remove duplicate URLs from the segment indexes. This is not to remove multiple fetches of the same URL because the URL has been duplicated in the WebDB--this cannot happen, since the WebDB does not allow duplicate URL entries. Instead, duplicates can arise if a URL is re-fetched and the old segment for the previous fetch still exists (because it hasn't been deleted). This situation can't arise during a single run of the crawl tool, but it can during re-crawls, so this is why dedup also removes duplicate URLs.

While the crawl tool is a great way to get started with crawling websites, you will need to use the lower-level tools to perform re-crawls and other maintenance on the data structures built during the initial crawl. We shall see how to do this in the real-world example later, in part two of this series. Also, crawl is really aimed at intranet-scale crawling. To do a whole web crawl, you should start with the lower-level tools. (See the "Resources" section for more information.)

Configuration and Customization

All of Nutch's configuration files are found in the conf subdirectory of the Nutch distribution. The main configuration file is conf/nutch-default.xml. As the name suggests, it contains the default settings, and should not be modified. To change a setting you create conf/nutch-site.xml, and add your site-specific overrides.

Nutch defines various extension points, which allow developers to customize Nutch's behavior by writing plugins, found in the plugins subdirectory. Nutch's parsing and indexing functionality is implemented almost entirely by plugins--it is not in the core code. For instance, the code for parsing HTML is provided by the HTML document parsing plugin, parse-html. You can control which plugins are available to Nutch with the plugin.includes and plugin.excludes properties in the main configuration file.

With this background, let's run a crawl on a toy site to get a feel for what the Nutch crawler does.

Running a Crawl

First, download the latest Nutch distribution and unpack it on your system (I used version 0.7.1). To use the Nutch tools, you will need to make sure the NUTCH_JAVA_HOME or JAVA_HOME environment variable is set to tell Nutch where Java is installed.

I created a contrived example with just four pages to understand the steps involved in the crawl process. Figure 1 illustrates the links between pages. C and C-dup (C-duplicate) have identical content.

Figure 1
Figure 1. The site structure for the site we are going to crawl

Before we run the crawler, create a file called urls that contains the root URLs from which to populate the initial fetchlist. In this case, we'll start from page A.

echo 'http://keaton/tinysite/A.html' > urls

The crawl tool uses a filter to decide which URLs go into the WebDB (in steps 2 and 5 in the breakdown of crawl above). This can be used to restrict the crawl to URLs that match any given pattern, specified by regular expressions. Here, we just restrict the domain to the server on my intranet (keaton), by changing the line in the configuration file conf/crawl-urlfilter.txt from

+^http://([a-z0-9]*/.)*MY.DOMAIN.NAME/

to

+^http://keaton/

Now we are ready to crawl, which we do with a single command:

bin/nutch crawl urls -dir crawl-tinysite -depth 3 >& crawl.log

The crawl uses the root URLs in urls to start the crawl, and puts the results of the crawl in the directory crawl-tinysite. The crawler logs its activity to crawl.log. The -depth flag tells the crawler how many generate/fetch/update cycles to carry out to get full page coverage. Three is enough to reach all of the pages in this example, but for real sites it is best to start with five (the default), and increase it if you find some pages aren't being reached.

We shall now look in some detail at the data structures crawl has produced.

Examining the Results of the Crawl

If we peek into the crawl-tinysite directory, we find three subdirectories: db, segments, and index (see Figure 2). These contain the WebDB, the segments, and the Lucene index, respectively.

Figure 2
Figure 2. The directories and files created after running the crawl tool

Nutch comes with several tools for examining the data structures it builds, so let's use them to see what the crawl has created.

WebDB

The first thing to look at is the number of pages and links in the database. This is useful as a sanity check to give us some confidence that the crawler did indeed crawl the site, and how much of it. The readdb tool parses the WebDB and displays portions of it in human-readable form. We use the -stats option here:

bin/nutch readdb crawl-tinysite/db -stats

which displays:

Number of pages: 4Number of links: 4

As expected, there are four pages in the WebDB (A, B, C, and C-duplicate) and four links between them. The links to Wikipedia are not in the WebDB, since they did match the pattern in the URL filter file. Both C and C-duplicate are in the WebDB since the WebDB doesn't de-duplicate pages by content, only by URL (which is why A isn't in twice). Next, we can dump all of the pages, by using a different option for readdb:

bin/nutch readdb crawl-tinysite/db -dumppageurl

which gives:

Page 1: Version: 4URL: http://keaton/tinysite/A.htmlID: fb8b9f0792e449cda72a9670b4ce833aNext fetch: Thu Nov 24 11:13:35 GMT 2005Retries since fetch: 0Retry interval: 30 daysNum outlinks: 1Score: 1.0NextScore: 1.0Page 2: Version: 4URL: http://keaton/tinysite/B.htmlID: 404db2bd139307b0e1b696d3a1a772b4Next fetch: Thu Nov 24 11:13:37 GMT 2005Retries since fetch: 0Retry interval: 30 daysNum outlinks: 3Score: 1.0NextScore: 1.0Page 3: Version: 4URL: http://keaton/tinysite/C-duplicate.htmlID: be7e0a5c7ad9d98dd3a518838afd5276Next fetch: Thu Nov 24 11:13:39 GMT 2005Retries since fetch: 0Retry interval: 30 daysNum outlinks: 0Score: 1.0NextScore: 1.0Page 4: Version: 4URL: http://keaton/tinysite/C.htmlID: be7e0a5c7ad9d98dd3a518838afd5276Next fetch: Thu Nov 24 11:13:40 GMT 2005Retries since fetch: 0Retry interval: 30 daysNum outlinks: 0Score: 1.0NextScore: 1.0

Each page appears in a separate block, with one field per line. The ID field is the MD5 hash of the page contents: note that C and C-duplicate have the same ID. There is also information about when the pages should be next fetched (which defaults to 30 days), and page scores. It is easy to dump the structure of the web graph, too:

bin/nutch readdb crawl-tinysite/db -dumplinks

which produces:

from http://keaton/tinysite/B.html to http://keaton/tinysite/A.html to http://keaton/tinysite/C-duplicate.html to http://keaton/tinysite/C.htmlfrom http://keaton/tinysite/A.html to http://keaton/tinysite/B.html

For sites larger than a few pages, it is less useful to dump the WebDB in full using these verbose formats. The readdb tool also supports extraction of an individual page or link by URL or MD5 hash. For example, to examine the links to page B, issue the command:

bin/nutch readdb crawl-tinysite/db -linkurl http://keaton/tinysite/B.html

to get:

Found 1 links.Link 0: Version: 5ID: fb8b9f0792e449cda72a9670b4ce833aDomainID: 3625484895915226548URL: http://keaton/tinysite/B.htmlAnchorText: BtargetHasOutlink: true

Notice that the ID is the MD5 hash of the source page A.

There are other ways to inspect the WebDB. The admin tool can produce a dump of the whole database in plain-text tabular form, with one entry per line, using the -textdump option. This format is handy for processing with scripts. The most flexible way of reading the WebDB is through the Java interface. See the Nutch source code and API documentation for more details. A good starting point is org.apache.nutch.db.WebDBReader, which is the Java class that implements the functionality of the readdb tool (readdb is actually just a synonym for org.apache.nutch.db.WebDBReader).

Segments

The crawl created three segments in timestamped subdirectories in the segments directory, one for each generate/fetch/update cycle. The segread tool gives a useful summary of all of the segments:

bin/nutch segread -list -dir crawl-tinysite/segments/

giving the following tabular output (slightly reformatted to fit this page):

PARSED? STARTED           FINISHED          COUNT DIR NAMEtrue    20051025-12:13:35 20051025-12:13:35 1     crawl-tinysite/segments/20051025121334true    20051025-12:13:37 20051025-12:13:37 1     crawl-tinysite/segments/20051025121337true    20051025-12:13:39 20051025-12:13:39 2     crawl-tinysite/segments/20051025121339TOTAL: 4 entries in 3 segments.

The PARSED? column is always true when using the crawl tool. This column is useful when running fetchers with parsing turned off, to be run later as a separate process. The STARTED and FINISHED columns indicate the times when fetching started and finished. This information is invaluable for bigger crawls, when tracking down why crawling is taking a long time. The COUNT column shows the number of fetched pages in the segment. The last segment, for example, has two entries, corresponding to pages C and C-duplicate.

Sometimes it is necessary to find out in more detail what is in a particular segment. This is done using the -dump option for segread. Here we dump the first segment (again, slightly reformatted to fit this page):

s=`ls -d crawl-tinysite/segments/* | head -1`bin/nutch segread -dump $s
Recno:: 0FetcherOutput::FetchListEntry: version: 2fetch: truepage: Version: 4URL: http://keaton/tinysite/A.htmlID: 6cf980375ed1312a0ef1d77fd1760a3eNext fetch: Tue Nov 01 11:13:34 GMT 2005Retries since fetch: 0Retry interval: 30 daysNum outlinks: 0Score: 1.0NextScore: 1.0anchors: 1anchor: AFetch Result:MD5Hash: fb8b9f0792e449cda72a9670b4ce833aProtocolStatus: success(1), lastModified=0FetchDate: Tue Oct 25 12:13:35 BST 2005Content::url: http://keaton/tinysite/A.htmlbase: http://keaton/tinysite/A.htmlcontentType: text/htmlmetadata: {Date=Tue, 25 Oct 2005 11:13:34 GMT, Server=Apache-Coyote/1.1,Connection=close, Content-Type=text/html, ETag=W/"1106-1130238131000",Last-Modified=Tue, 25 Oct 2005 11:02:11 GMT, Content-Length=1106}Content:<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><title>'A' is for Alligator</title></head><body><p>Alligators live in freshwater environments such as ponds,marshes, rivers and swamps. Although alligators haveheavy bodies and slow metabolisms, they are capable ofshort bursts of speed that can exceed 30 miles per hour.Alligators' main prey are smaller animals that they can killand eat with a single bite. Alligators may kill larger preyby grabbing it and dragging it in the water to drown.Food items that can't be eaten in one bite are either allowedto rot or are rendered by biting and then spinning orconvulsing wildly until bite size pieces are torn off.(From<a href="http://en.wikipedia.org/wiki/Alligator">theWikipedia entry for Alligator</a>.)</p><p><a href="B.html">B</a></p></body></html>ParseData::Status: success(1,0)Title: 'A' is for AlligatorOutlinks: 2outlink: toUrl: http://en.wikipedia.org/wiki/Alligatoranchor: the Wikipedia entry for Alligatoroutlink: toUrl: http://keaton/tinysite/B.html anchor: BMetadata: {Date=Tue, 25 Oct 2005 11:13:34 GMT,CharEncodingForConversion=windows-1252, Server=Apache-Coyote/1.1,Last-Modified=Tue, 25 Oct 2005 11:02:11 GMT, ETag=W/"1106-1130238131000",Content-Type=text/html, Connection=close, Content-Length=1106}ParseText::'A' is for Alligator Alligators live in freshwater environments suchas ponds, marshes, rivers and swamps. Although alligators have heavybodies and slow metabolisms, they are capable of short bursts ofspeed that can exceed 30 miles per hour. Alligators' main prey aresmaller animals that they can kill and eat with a single bite.Alligators may kill larger prey by grabbing it and dragging it inthe water to drown. Food items that can't be eaten in one bite areeither allowed to rot or are rendered by biting and then spinning orconvulsing wildly until bite size pieces are torn off.(From the Wikipedia entry for Alligator .) B

There's a lot of data for each entry--remember this is just a single entry, for page A--but it breaks down into the following categories: fetch data, raw content, and parsed content. The fetch data, indicated by the FetcherOutput section, is data gathered by the fetcher to be propagated back to the WebDB during the update part of the generate/fetch/update cycle.

The raw content, indicated by the Content section, contains the page contents as retrieved by the fetcher, including HTTP headers and other metadata. (By default, the protocol-httpclient plugin is used to do this work.) This content is returned when you ask Nutch search for a cached copy of the page. You can see the HTML page for page A in this example.

Finally, the raw content is parsed using an appropriate parser plugin--determined by looking at the content type and then the file extension. In this case, parse-html was used, since the content type is text/html. The parsed content (indicated by the ParseData and ParseText sections) is used by the indexer to create the segment index.

Index

The tool of choice for examining Lucene indexes is Luke. Luke allows you to look at individual documents in an index, as well as perform ad hoc queries. Figure 3 shows the merged index for our example, found in the index directory.

Figure 3
Figure 3. Browsing the merged index in Luke

Recall that the merged index is created by combining all of the segment indexes after duplicate pages have been removed. In fact, if you use Luke to browse the index for the last segment (found in the index subdirectory of the segment) you will see that page C-duplicate has been removed from the index. Hence, the merged index only has three documents, corresponding to pages A, B, and C.

Figure 3 shows the fields for page A. Most are self-explanatory, but the boost field deserves a mention. It is calculated on the basis of the number of pages linking to this page--the more pages that link to the page, the higher the boost. The boost is not proportional to the number of inbound links; instead, it is damped logarithmically. The formula used is ln(e + n), where n is the number of inbound links. In our example, only page B links to page A, so there is only one inbound link, and the boost works out as ln(e + 1) = 1.3132616 ...

You might be wondering how the boost field is related to the page score that is stored in the WebDB and the segment fetcher output. The boost field is actually calculated by multiplying the page score by the formula in the previous paragraph. For our crawl--indeed, for all crawls performed using the crawl tool--the page scores are always 1.0, so the boosts depend simply on the number of inbound links.

When are page scores not 1.0? Nutch comes with a tool for performing link analysis, LinkAnalysisTool, which uses an algorithm like Google's PageRank to assign a score to each page based on how many pages link to it (and is weighted according to the score of these linking pages). Notice that this is a recursive definition, and it is for this reason that link analysis is expensive to compute. Luckily, intranet search usually works fine without link analysis, which is why it is not a part of the crawl tool, but it is a key part of whole web search--indeed, PageRank was crucial to Google's success.

Conclusion

In this article, we looked at the Nutch crawler in some detail. The second article will show how to get the Nutch search application running against the results of a crawl.

Resources

  • The Nutch project page is the place to start for more information on Nutch. The mailing lists for nutch-user and nutch-dev are worth searching if you have a question.
  • At the time of this writing, the Map Reduce version of Nutch is in the main trunk and is not in a released version. This means that you need to build it yourself if you want to use it (or wait for version 0.8 to be released).
  • For more on whole-web crawling, see the Nutch tutorial.
  • For more information on Nutch plugins (which are based on the Eclipse 2.0 plugin architecture), a good starting point is PluginCentral on the Nutch Wiki.
  • Creative Commons provides a Nutch-powered search option for finding Creative-Commons-licensed content (see also this blog entry).
  • " Building Nutch: Open Source Search" (ACM Queue, vol. 2, no. 2, April 2004), by Mike Cafarella and Doug Cutting, is a good high-level introduction to Nutch.
  • "Nutch: A Flexible and Scalable Open Source Web Search Engine" (PDF) (CommerceNet Labs Technical Report 04-04, November 2004), by Rohit Khare, Doug Cutting, Kragen Sitaker, and Adam Rifkin, covers the filesystem scale of Nutch particularly well.