The Use of Last-Modified Date


When web crawling to gather HTML documents for indexing, Verity Spider looks for the date the document was last modified, in the form of a field named Last-Modified. The value of Last-Modified is used to determine if documents should be indexed again.

How Last-Modified is Used

For HTML documents which have been indexed, and for which a value exists in the last_modified_date field in the persistent store, Verity Spider compares the retrieved document's Last-Modified value with last_modified_date. What happens to the document depends on the outcome of the comparison.

You can see the last_modified_date value for documents already in a collection's persistent store by running the report utility, vsdb, with the -date option. For more information, see "Verity Spider Reporting" earlier in this chapter.

New Documents

For HTML documents that have never been indexed, the value for Last-Modified is irrelevant. The Last-Modified date, if it exists, is stored in the last_modified_date field of the persistent store for the collection into which the document is being indexed.

When there is no Last-Modified value in the persistent store, then Verity Spider has nothing against which to compare future versions of the document and so the document will be indexed the next time it is encountered.

NOTE: A workaround is to use -refreshtime to ensure that documents just recently indexed are not indexed again. For more information, see Chapter 2, "Verity Spider Reference."

Dynamic Documents

If you are dealing with dynamically generated HTML documents, then there may never be a Last-Modified date and so the document may always be indexed. A workaround is to incorporate a meta tag into the processing of the dynamic documents and take advantage of the -metafile option. See "Using a Custom Last-Modified Value" below.

How Last-Modified is Determined

When indexing web sites, Verity Spider reads the Last-Modifed HTTP header field. The value of Last-Modifed is normally provided by the web server from which the document is served. In some cases, though, the web server may be configured to not provide a Last-Modifed value. Verity Spider can only recognize the standard HTTP header field, in the form of Last-Modifed.

If you are working with web servers which do not provide this field, or you are indexing dynamically generated files, or you simply want to use your own value, then you must provide a meta tag in the documents themselves and create a file which maps your meta tag to Last-Modifed. For more information, see "Using a Custom Last-Modified Value" below.

Using a Custom Last-Modified Value

There are basically two scenarios which involve using a custom value for Last-Modified. These are:

Providing a Value for Last-Modified

When a web server delivers documents without a Last-Modfied value, whether it is because of a web server configuration or because the documents are generated dynamically, you can provide Verity Spider with a value. For more information, review the "Example" below.

Overriding an Existing Last-Modified Value

When indexing web sites, you may want to use your own date/time values to specify when a document was last indexed. Review the "Example" below and keep in mind that you must specify the "Y" override flag in your map file to ensure that your value is always used.

Example

To incorporate a custom value for Last-Modified when indexing web sites, do the following:

1. Add a meta tag to all HTML documents for which you want to assign your own value for Last-Modified. Meta tags use the syntax:

<meta name="name" content="content">

where name is any string which you will later use in the text map file, and content, is the date and time you want to use for Last-Modified. The value for content must be in one of the following date formats:

Date format
Example
RFC822 (updated by RFC 1123)
Example: Sun, 06 Nov 1994 08:49:37 GMT
RFC850 (obsoleted by RFC 1036)
Example: Sunday, 06-Nov-94 08:49:37 GMT
ANSI C's asctime() format
Example: Sun Nov 6 08:49:37 1994

Warning! The day value must occupy two spaces.If you only have one digit, as in the example, then you must provide an extra space between the month and the digit.

For dynamic documents, such as from database middleware or scripts, you will have to find a way to incorporate the meta tag into the processing of the document.

NOTE: In order to provide or override the Last-Modified value, a meta tag must exist in the HTML documents themselves. If you do not have any control over the web server or documents in question, you can still exercise some control over when documents are indexed by using -refreshtime and -refresh with inclusion or exclusion criteria. For more information on these options, see "Reference of Command-line Options" in Chapter 2.

2. Create a text file that contains an entry for mapping your custom meta tag name to the expected Last-Modified field name. The syntax for the entry is:

name Last-Modified Y|N

where Y/N is an override flag which can be either yes or no.

If you use the Y override flag, the value for the custom meta tag overrides the value for Last-Modified, even if both values are present and differ. If you use the N override flag, then the value for the custom meta tag will be used only if there is no value for Last-Modified. If a value for Last-Modified exists, then that is given precedence.

3. Run a Verity Spider indexing job with the -metafile option.





Copyright © 1998, Verity, Inc. All rights reserved.