Improved Verity Spider Workflow


The Verity Spider follows a distinct workflow during an indexing job. The following diagram illustrates the steps involved in a new indexing job. Note that this improved flow is valid only for new indexing jobs. Restarting and refreshing jobs are shorter.

When Verity Spider parses links or follows directories, new links are passed back up to be subjected to the gathering criteria. During a restarted indexing job, Verity Spider starts by looking in the persistent store for URLs that were queued for downloading but never downloaded. Documents are downloaded from Web sites into a local disk cache and are deleted after they are indexed.

Updating Collections

Verity's collection architecture allows incremental indexing so only new, updated, or deleted documents are re-indexed rather than always re-indexing all documents. This architecture also performs online indexing, allowing searches to continue during collection maintenance and updates. This eliminates down-time because of indexing updates.

When you search a collection while it is being updated, you may encounter fewer documents than before the update. This may occur because the Verity Spider has deleted documents which could no longer be found and would return "404 File Not Found."

Warning! The Verity Spider V3.6 cannot update meta collections. See "Verity Spider V3.1 and GUI Spider Collections" later in this chapter.





Copyright © 1998, Verity, Inc. All rights reserved.