[English | Japanese]

Namazu Tips


Table of contents

Fast indexing

Saving memory for indexing

Indexing takes a lot of memory. If you encounter "Out of memory!" error at runtime of mknmz, the following precautions can be considered.

Score weighting by HTML elements

By default, the following rules are applied for score weighting. These values are decided empirically, and has no theoretical foundations.

Moreover, for <meta name="keywords" content="foo bar"> foo bar , score 32 is used.

HTML processing

Namazu decodes &quot;, &amp;,&lt;, &gt; as well as named and numbered entity in &#9-10 and &#32-126. Since the internal encoding is EUC-JP, the right half of ISO-8859-1 (0x80-0xff) cannot be used. By the same reason, numbered entity in UCS-4 cannot be used.

Line Adjustment

Spaces, tabs at the beginning and the end of lines and > | # : at the beginning are removed. If the line ends with a Japanese character, the newline code will be ignored. (This prevents segmentation of Japanese words at the end of line.) These processing will be effective particularly for Mail files. Moreover, recovery of English hyphenation will be handled.

HTML documents digest

HTML defines the structure of documents. A simple digest can be made by using the heading information of documents defined by <h[1-6]>. By default, the length of digest is set to 200 characters. If words from the heading are not enough, more words are supplemented from the beginning of the documents. If the target is text file, the first 200 characters of the documents are simply used.

Mail/News Digest

When dealing Mail/News files, quotation indicators as in (foo@bar.jp wrote:"), quotation bodies beginning, for example, with > are not included in the Mail/News digest. Note that these messages are not included in the digests, but are included in the search targets.

Symbol handling

Symbol handling is rather a difficult task. Consider a sentence (foo is bar.) . If we separate it with spaces, "(foo", "is", "bar.)" will be indexed and foo or bar cannot be searched.

To solve this problem, the easiest solution is to remove all the symbols. However, we sometimes wish to search words that has symbols as in .emacs, TCP/IP . For a symbol-embedded string "tcp/ip", Namazu decomposes it into 3 terms "tcp/ip", "tcp", "ip" and registers independently.

For (tcp/ip), Namazu decomposes it into 4 terms "(tcp/ip)", "tcp/ip", "tcp", "ip" . Note that no recursive processing is done, ((tcp/ip)) will be decomposed into "((tcp/ip))", "(tcp/ip)", "tcp", "ip". The indexes for the first example (foo is bar.) will be separated as "(foo", "foo", "is", "bar.)", "bar.", "bar", so foo or bar can be searched.

(Pseudo) Phrase searching

A straight-forward phrase searching implementation will lead to an unacceptable index size. Namazu converts words into hash values to reduce the index size.

If the search expression is given as a phrase "foo bar", Namazu first performs AND searching for "foo" and "bar", and then filters the results by the phrase information.

The phrase information is 2-word unit and is recorded as 16 bit hash value. For this reason, phrases with more than 2 words cannot be searched accurately. For a phrase searching "foo bar baz", the documents only including "foo bar" and "bar baz" will also be retrieved:

...
foo bar ...
... bar baz

When collision of hash values occurred, wrong search results may be returned. But, at least, words foo, bar, baz are all included, for mistakenly retrieved documents.

Updating index for updated documents and/or deleted documents

Updating will be done by not updating/deleting of the documents information from index, but recording the deleted documents information. In other words, the index is intact and simply records the ID of the document that is deleted in addition to the original index.

If updates of an index caused by deleted/updated documents are repeated, the information of deleted documents is increased, and consequently the efficiency of index recording will be lost. In this case, we recommend to clean garbage by gcnmz.


Namazu Homepage

$Id: tips.html.en,v 1.14 2006/10/21 06:26:08 opengl2772 Exp $
developers@namazu.org