Verity Spider Features


The Verity Spider enables you to index web-based and file system documents throughout the enterprise. Verity Spider works in conjunction with Verity's KeyView document filtering technology so that more than two hundred of the most popular application document formats can be indexed, including Office97 and WordPerfect, ASCII text, HTML, SGML and PDF (Adobe Acrobat) documents.

Support for Multiple Platforms

The Verity Spider is supported on multiple platforms, including Windows NT Server and Solaris. For a complete list of the supported platforms with version numbers, see the SEARCH’97 Information Server Product Notes.

NOTE: In this release, the Verity Spider will run on Solaris V2.51 only under the following circumstances:

Supports Web Standards

Verity Spider supports key Web standards used by Internet and intranet sites today. Standard HREF links and frames pointers are recognized so that navigation through them is supported. Redirected pages are followed so that the real underlying document is indexed. Verity Spider adheres to the robots exclusion standard specified in robots.txt files, so that administrators can maintain friendly visits to remote Web sites. HTTP Basic Authentication mechanism is supported so that password-protected sites can be indexed.

Unlike other Web crawlers, Verity Spider does not need to maintain local copies of remote documents. Verity indexes store pointers to the original documents so that disk space usage is not an issue. When documents are viewed through Verity Information Server, documents are read from their native location with optional highlights.

Indexing Other Data Sources

In addition to indexing standard application document formats through Web servers and the file system, Verity Spider can also index other data sources.

ODBC-compliant Databases

A separate product, Verity Database Access, is available for indexing and searching of information stored in ODBC-compliant databases. Contact your Verity sales representative for more information.

Microsoft Exchange

A separate product, Verity Information Server for Microsoft Exchange is available for indexing and searching of Exchange folders and mail. Contact your Verity sales representative for more information.

Unsupported Data Sources

Verity Spider V3.6 does not support indexing of the following, although support may be added for these and other data sources in the future.

Restart Options

When an indexing job fails, or for some reason the Verity Spider cannot index a significant number or type of URLs, you can now intelligently restart the indexing job to update the collection without re-running the entire job.

State Maintenance Through a Persistent Store

Verity Spider V3.6 stores the state of URLs, both gathered and indexed in a persistent store, allowing it to track progress for the purposes of gracefully and efficiently restarting halted indexing jobs.

Previous versions of Verity Spider held state information in memory, which meant that any stoppage of spidering resulted in lost work. This also meant that larger target sites required significantly more memory for spidering. The information in the persistent store can help report information such as the number of indexed pages, number of visited pages, number of rejected pages, and number of broken links.

Warning! The Verity Spider persistent store is platform dependent and cannot be copied across operating systems. If you want to copy a collection from one platform to another, you will have to perform a synchronization, using -resync, to recreate the persistent store on the new platform.

Performance Improvements

With low memory requirements and the help of multithreading and more efficient Domain Name System (DNS) lookups, spidering performance is greatly improved over previous versions.

Multithreading

Verity Spider V3.6 separates the gathering and indexing jobs into multiple threads for concurrence. Verity Spider V3.6 can create concurrent connections to Web servers for fetching documents, and have concurrent indexing threads for maximum utilization. This translates to an overall improvement in throughput. In previous releases, work was done in a round-robin manner, so that at any given time, only one job was running.

Efficient DNS Lookups

Verity Spider V3.6 significantly reduces DNS lookups, which means great improvements to spidering throughput. If spidering is limited by domain or host, then no DNS lookups are made on hosts that fall outside of that range. Previously, DNS lookups were made on all candidate URLs.

Proxy Handling Improvements

The addition of -noproxy for reducing proxy checking for certain hosts and -proxyauth for authenticating on proxy servers allows for much greater flexibility when dealing with indexing jobs that involve proxy servers and firewalls.





Copyright © 1998, Verity, Inc. All rights reserved.