File Systems


You want to spider a network drive to index all Microsoft Word and ASCII text documents, while skipping all directories named TEMP, CONFIDENTIAL, and ACCOUNTING and all others more than 4 subdirectories deep.

vspider -cmdfile c:\verity\vspider\files.cmd

where files.cmd consists of:

-collection icd.coll
-start f:\documents
-indmimeinclude application/msword -indmimeinclude 'text/plain'
-prunedir 'TEM*' 'CONFIDENTI*' 'ACCOUNT*' -pathlen 6

If you find MIME types are being dropped, or you know you will be indexing files whose extensions are not known to the Verity Spider by default, use the regular expression '*/*' for your MIME criteria.

For example:

-mimeinclude '*/*'

Remember, on UNIX you need to include single quotes for values which include wildcard characters.

Furthermore, you should also use inclusion and exclusion criteria to fine tune what is indexed.

Case-specific Options

Option
Reason
-indmimeinclude
This option specifies that only the specified MIME types are to be indexed. Although other files may be gathered, such as HTML pages which contain links to the desired document types, only those types which are indicated are actually indexed.

In this case, an additional instance of -indmimeinclude is necessary to also index a second MIME type. You could also include all values in a single instance of -indmimeinclude.

-prunedir, -pathlen
These options allow you to control what is followed for gathering.

Specifically, -pathlen must be 6 because you must account for the starting path segment, the 4 directories below the starting directory, and a filename (total of 6 pathlen units).

Unnecessary Options for this Case

Option
Reason
-indmimeexlude
By default, the will only index those document types specified with indmimeinclude, automatically ignoring all others. Therefore it is not necessary to explicitly specify that the text/html MIME type be excluded.
All Networking Options, -cgiok, -norobo
These options only affect indexing web sites.





Copyright © 1998, Verity, Inc. All rights reserved.