Limiting the Spider


By default, the Verity Spider is not limited from following links (during web crawling) or walking through directory structures (during directory walking). Web crawling starts at a specified URL and follows the links without respect to URL-implied directory hierarchy. Directory walking starts at a named directory and walks through the subdirectories it finds.

During Web Crawling

Using the -include option, you can limit the Verity Spider's web crawling behavior. For example, if you start the indexing task at a URL without limitation, the Verity Spider will follow links anywhere, including to a location "above" the starting directory. For example, from this starting URL:

http://www.some.web.site/region2/sales/

the Verity Spider will follow links to

http://www.some.web.site/

To limit the scope of spidering to the starting directory, you can specify an include pattern, as follows:

'*/region2/sales/*'

To create a new collection while limiting indexing to the starting directory, use the following Verity Spider syntax:


vspider -collection mycoll.col
-start http://www.some.site.com/region2/sales
-include '*/region2/sales/*'
NOTE: For a more flexible approach, try using -indinclude. For more information, see Chapter 2, "Verity Spider Reference."

During Directory Walking

Using the -prunedir option, you can specify directories to be skipped by the Verity Spider when it does directory walking. This option takes one or more regular expression patterns.

The Verity spider uses C-shell style regular expressions, not grep style. For example, the following is a valid expression:

-prunedir 'foo/ba[rz]'

The following expression is invalid and has no effect:

-prunedir 'foo(bar|baz)'

A trailing slash is also invalid:

-prunedir 'foo/bar/'

You can use multiple -prunedir expressions, as follows:

-prunedir 'foo/bar' -prunedir 'x/y'





Copyright © 1998, Verity, Inc. All rights reserved.