Indexing Workflow


Following is an example indexing scenario, from starting an indexing job to updating the collection and refreshing it. The assumptions for this workflow scenario are:

Keep in mind that there are numerous other options available for your indexing jobs. For this scenario, the indexing commands are kept simple to illustrate the key points of indexing only certain documents and following a collection through a likely life cycle.

NOTE: All Verity Spider commands must be issued as a single line from the command-line. They are broken up here for readability.

Step 1 - Start the Indexing Job

For your first indexing job you want to create a new collection into which you will index only HTML documents from your company's intranet while excluding files from your publicly accessible web server.


vspider -collection /host1/usr/verity/collections/internal.coll
-start http://web.verity.com/docs/index.htm
-domain verity.com
-mimeinclude text/html -exclude `http://www.verity.com/*'
-jumps 5

Restarting

If this job should fail or halt for some reason, you can restart it with:


vspider -collection /host1/usr/verity/collections/internal.coll
-restart
-domain verity.com
-mimeinclude text/html -exclude `http://www.verity.com/*'
-jumps 5
When you restart an indexing job, you omit any -start entries and instead use the -restart option. You do need to include the other options such as inclusion criteria.

Step 2 - Importing the Collection into Information Server

Now that you have created the collection, you must import it into Information Server so it can be searched.

Use the Admin interface for Information Server to import the collection.

Step 3 - Refresh the Collection

The internal web sites change on a frequent basis because they contain status and work-in-progress information prepared by all employees. To refresh the HTML documents in your collection, run the command:


vspider -collection /host1/usr/verity/collections/internal.coll
-refresh
-refreshtime 4 hours
-domain verity.com
-mimeinclude text/html -exclude `http://www.verity.com/*'
-jumps 5
NOTE: Although you will likely want to run this indexing job often, using the -refreshtime option allows you to pass over documents recently indexed. You know the documents change often, but not constantly.

Step 4 - Update the Collection with Word Documents

With your collection in place, you now want to add all Word documents from your file server.


vspider -collection /host1/usr/verity/collections/internal.coll
-start /usr/docs/
-indmimeinclude application/msword

Restarting

If this job should fail or halt for some reason, you can restart it with:


vspider -collection /host1/usr/verity/collections/internal.coll
-restart
-indmimeinclude application/msword

Step 5 - Move the Collection to Another Server

You find that the internal.coll collection has become very popular for searching, and requires refreshing every four hours and one second. To alleviate the strain on the current server, you decide to move the collection to a larger server running a different operating system. The instructions which follow assume that the directories are linked via NFS. To accomplish the move, do the following:

1. Copy the collection directory structure. According to the assumptions, you would issue the following command at a shell prompt:

% cp -rp /host1/usr/verity/collections/internal.coll /host2/usr/verity/colls

After you have copied the collection to the new server, you may want to delete the original directory structure on the other server.

2. With the Admin interface for Information Server running on Caliber, import the collection internal.coll.

Synchronizing the Collection

Now that the collection is on Caliber, synchronize the collection to recreate the persistent store for the new operating system platform. This is only necessary if the operating systems on the two platforms differ, which they do for this scenario.


vspider -collection /host2/usr/verity/colls/internal.coll
-resync

Step 6- Refresh the Collection

To update all documents in the internal.coll collection now on the new server, run the command:


vspider -collection /host2/usr/verity/colls/internal.coll
-refresh -refreshtime 4 hours
-domain verity.com
-mimeinclude text/html -exclude `http://www.verity.com/*'
-mimeinclude application/msword
-jumps 5
You will likely want to continue to refresh only the HTML documents. Run the same command above, only omit -mimeinclude application/msword to avoid also picking up the Word documents.





Copyright © 1998, Verity, Inc. All rights reserved.