Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cache expiry #1

Open
dazza-codes opened this issue Feb 18, 2015 · 7 comments
Open

Use cache expiry #1

dazza-codes opened this issue Feb 18, 2015 · 7 comments
Assignees

Comments

@dazza-codes
Copy link
Owner

When storing retrieved RDF in a local cache, try to add a triple that represents any cache expiry headers from the original source of the RDF (LOC, VIAF, ISNI, OCLC, etc.). Create a cron job process that can query the local cache to extract entities with cache-control data that has expired, then use a background queue that runs processes with a 'nice' priority to retrieve and update the expired RDF.

If mongodb is the local cache, create an index on the cache-control data. Where possible, encourage RDF providers to use cache-control headers with reasonable values that correspond to their routine update cycles (if they are routine). If an entire repository has the same expiry, it might be readily stored in only a few triples.

@dazza-codes
Copy link
Owner Author

For example, the HEAD request on this LOC authority
http://id.loc.gov/authorities/names/n79044798
Response headers:

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Cache-Control: public, max-age=43200
X-PrefLabel: Byrnes, Christopher I., 1949-
X-URI: http://id.loc.gov/authorities/names/n79044798
Server: Apache
Accept-Ranges: bytes
Date: Thu, 19 Feb 2015 04:31:12 GMT
X-Varnish: 1233935832
Age: 0
Via: 1.1 varnish
Connection: keep-alive

The useful data for caching here is Cache-Control: public, max-age=43200 and the Date: Thu, 19 Feb 2015 04:31:12 GMT. In this case, the data may be cached for 12 hours after the Date; beyond that time, the cached data may be stale.

In theory, the max-age should correspond to the frequency that LOC updates this data on their servers. In practice, they don't want to manage these details, so the policy is to have the data cached only for 12 hours so that whenever it is updated, the caches will be refreshed within a day.

It may be preferable for LOC to set the Expires header with a date that corresponds to their next scheduled update for the data; see https://devcenter.heroku.com/articles/increasing-application-performance-with-http-cache-headers#expires

Another option: if LOC sets the Last-Modified header, a conditional request can be issued using an If-Modified-Since request header that has the date the data was last cached. If the response code is 304, there is no new data available (so use the cached data); see
https://devcenter.heroku.com/articles/increasing-application-performance-with-http-cache-headers#time-based

Another option: LOC could set the ETag (or Entity Tag) header, which works in a similar way to the Last-Modified header except its value is a digest of the resources contents (for instance, an MD5 hash). Then a conditional request can use the If-None-Match request header with an ETag value of the cached data. If the response code is 304, there is no new data available (so use the cached data); see
https://devcenter.heroku.com/articles/increasing-application-performance-with-http-cache-headers#content-based

Note on varnish headers: All requets in varnish are assigned a XID number, the X-Varnish tells you what it is, and if a cache-hit was involved, also the XID of the transaction that put the object in the cache. The Age header tells how long a particular object has been in varnish's cache.

@dazza-codes dazza-codes self-assigned this Feb 19, 2015
@dazza-codes
Copy link
Owner Author

There is additional administration metadata in the RDF response; e.g. consider the RDF body in a GET request for http://id.loc.gov/authorities/names/n79044798.rdf

However, the GET request incurrs more work on the server to construct the body, the network to deliver the packets, and the client to parse the content. In addition, in this case, there is no information about the data that is designed for caching it.

SKOS metadata:

<skos:changeNote xmlns:skos="http://www.w3.org/2004/02/skos/core#">
    <cs:ChangeSet xmlns:cs="http://purl.org/vocab/changeset/schema#">
        <cs:subjectOfChange rdf:resource="http://id.loc.gov/authorities/names/n79044798"/>
        <cs:creatorName rdf:resource="http://id.loc.gov/vocabulary/organizations/dlc"/>
        <cs:createdDate rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">1979-05-22T00:00:00</cs:createdDate>
        <cs:changeReason rdf:datatype="http://www.w3.org/2001/XMLSchema#string">new</cs:changeReason>
    </cs:ChangeSet>
</skos:changeNote>
<skos:changeNote xmlns:skos="http://www.w3.org/2004/02/skos/core#">
    <cs:ChangeSet xmlns:cs="http://purl.org/vocab/changeset/schema#">
        <cs:subjectOfChange rdf:resource="http://id.loc.gov/authorities/names/n79044798"/>
        <cs:creatorName rdf:resource="http://id.loc.gov/vocabulary/organizations/dlc"/>
        <cs:createdDate rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-12-08T08:21:09</cs:createdDate>
        <cs:changeReason rdf:datatype="http://www.w3.org/2001/XMLSchema#string">revised</cs:changeReason>
    </cs:ChangeSet>
</skos:changeNote>

MADS-RDF metadata:

<madsrdf:adminMetadata>
    <ri:RecordInfo xmlns:ri="http://id.loc.gov/ontologies/RecordInfo#">
        <ri:recordChangeDate rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">1979-05-22T00:00:00</ri:recordChangeDate>
        <ri:recordStatus rdf:datatype="http://www.w3.org/2001/XMLSchema#string">new</ri:recordStatus>
        <ri:recordContentSource rdf:resource="http://id.loc.gov/vocabulary/organizations/dlc"/>
        <ri:languageOfCataloging rdf:resource="http://id.loc.gov/vocabulary/iso639-2/eng"/>
    </ri:RecordInfo>
</madsrdf:adminMetadata>
<madsrdf:adminMetadata>
    <ri:RecordInfo xmlns:ri="http://id.loc.gov/ontologies/RecordInfo#">
        <ri:recordChangeDate rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-12-08T08:21:09</ri:recordChangeDate>
        <ri:recordStatus rdf:datatype="http://www.w3.org/2001/XMLSchema#string">revised</ri:recordStatus>
        <ri:recordContentSource rdf:resource="http://id.loc.gov/vocabulary/organizations/dlc"/>
        <ri:languageOfCataloging rdf:resource="http://id.loc.gov/vocabulary/iso639-2/eng"/>
    </ri:RecordInfo>
</madsrdf:adminMetadata>

@dazza-codes
Copy link
Owner Author

The headers on a VIAF resource appear to have no cache control, e.g. a HEAD or GET on
http://viaf.org/viaf/108317368/rdf

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Location: rdf.xml
Content-Type: text/xml
Content-Length: 6780
Date: Thu, 19 Feb 2015 16:46:28 GMT

Also, there is no RDF metadata about the date of creation or revision for the RDF resource at VIAF.

@dazza-codes
Copy link
Owner Author

Notes on evaluating cache-control information.

First, this is relatively easy using Firefox with a plugin for HTTP resource testing, because it gives easy access to specify additional HTTP request types, like HEAD, and additional request headers. See [https://addons.mozilla.org/en-us/firefox/addon/http-resource-test/]

When this is installed in firefox, it's available from the Tools > HTTP Resource Test menu. So, first paste in the example URI as 'http://id.loc.gov/authorities/subjects/sh85000399' in the top left box (URI) and then select the 'HEAD' request from the drop-down menu at the top right (Method). (Set no additional parameters in the 'Client Request' for the moment.) Then hit the submit button and the server response comes back in the 'Server Response' panels below, one for 'Headers' and one for the 'Body'.

Before reviewing the details of the response, let's be on the same page with regard to the meaning and purpose of a HEAD request. The HEAD request should retrieve only the response headers, without the body. The HEAD response for [http://id.loc.gov/authorities/subjects/sh85000399] does include a body; that MUST disappear. The spec is described here:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html

To quote that document (my italics and my emphasis in bold):

  • 9.4 HEAD
    • The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification.
    • The response to a HEAD request MAY be cacheable in the sense that the information contained in the response MAY be used to update a previously cached entity from that resource. If the new field values indicate that the cached entity differs from the current entity (as would be indicated by a change in Content-Length, Content-MD5, ETag or Last-Modified), then the cache MUST treat the cache entry as stale.

This is the HEAD header response from [http://id.loc.gov/authorities/subjects/sh85000399](I'm not going to review the HEAD body response, it should not exist):

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Cache-Control: public, max-age=43200
Etag: 08918596b834994da64b993ff32d21c1
X-PrefLabel: Accordion music (Jazz)
X-URI: http://id.loc.gov/authorities/subjects/sh85000399
Server: Apache
Content-Length: 17973
Accept-Ranges: bytes
Date: Fri, 27 Feb 2015 17:27:03 GMT
X-Varnish: 814603380
Age: 0
Via: 1.1 varnish

I'll try to explain how I currently understand the useful cache-control values and what can be used and might be added. A quick caveat -- my expertise in this area is limited and an authoritative understanding should be gained from consulting the standards from W3C, see mainly section 13 and especially 14.9 at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html -- I'm using this document because it's somewhat easier to navigate and read, but there are more recent updates on this spec ( RFC2616 was replaced by multiple RFCs 7230-7237), although they might not yet be implemented in server/client code -- anyhow, I'm assuming we are working with HTTP/1.1 systems; specific sections noted below.

  • Cache-Control: public, max-age=43200
    • this means that intermediary proxy/cache systems in the public network infrastructure can cache this data and it can be considered good to cache for 43,200 seconds (or 12 hours). In some cases, a client might get this data from an intermediary proxy/cache that has cached the data (for up to 12 hours), rather than directly from id.loc.gov (unless the client overrides it with specific cache control request directives). The max-age on this value is likely reasonable and should not be used to represent information about any changes to the content.
  • Content-Length: 17973
    • this is the number of bytes in the body and could be used to cache data, but different characters could add up to the same content length, so it's not a unique identifier of the content to be cached.
    • 14.13 at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html -- "The Content-Length entity-header field indicates the size of the entity-body, in decimal number of OCTETs, sent to the recipient or, in the case of the HEAD method, the size of the entity-body that would have been sent had the request been a GET."
  • Content-MD5: missing
    • maybe this could be used in addition to, or instead of, an Etag when the Etag has an MD5 hash value, only to be more specific about what kind of hash algorithm is used?
    • 14.15 at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html -- "The Content-MD5 entity-header field, as defined in RFC 1864 ..., is an MD5 digest of the entity-body for the purpose of providing an end-to-end message integrity check (MIC) of the entity-body."
  • Etag: 08918596b834994da64b993ff32d21c1
    • this is currently implemented for id.loc.gov/authorities as an MD5 hash of "the entire document serialized"; it should be a unique identifier (the probability of clashes are defined by the hash algorithm, usually very low probability); this should be a lot more reliable than the 'Content-Length' value; I'm not entirely clear about what exactly this MD5 hash was calculated on and it might be important, depending on how we interpret what the Etag value should be and how it should be used. An important question is whether the Etag value should be the same for an 'entity' (an authority record) that can be serialized in different digital streams (i.e. same Etag for .rdf, .json-ld, .nt, etc.)?
    • 13.3.3 at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html -- rules on how to determine if two entities tags match.
    • 14.19 at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html -- "The ETag response-header field provides the current value of the entity tag for the requested variant. The headers used with entity tags are described in sections 14.24, 14.26 and 14.44. The entity tag MAY be used for comparison with other entities from the same resource (see section 13.3.3)." Note that many of the 14.24, 14.26, and 14.44 protocols can be modified by 'If-Modified-Since', which depends on 'Last-Modified' data (considered below).

So these related client cache control header requests have implications for how the server responds to new requests:

  • 14.24 'If-Match' at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html -- "The If-Match request-header field is used with a method to make it conditional. A client that has one or more entities previously obtained from the resource can verify that one of those entities is current by including a list of their associated entity tags in the If-Match header field. Entity tags are defined in section 3.11. The purpose of this feature is to allow efficient updates of cached information with a minimum amount of transaction overhead. It is also used, on updating requests, to prevent inadvertent modification of the wrong version of a resource."
  • 14.26 'If-None-Match' at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html -- "The If-None-Match request-header field is used with a method to make it conditional. A client that has one or more entities previously obtained from the resource can verify that none of those entities is current by including a list of their associated entity tags in the If-None-Match header field. The purpose of this feature is to allow efficient updates of cached information with a minimum amount of transaction overhead. It is also used to prevent a method (e.g. PUT) from inadvertently modifying an existing resource when the client believes that the resource does not exist."
  • Last-Modified: missing
    • please add it because it can modify Etag request/response behavior; see http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.3
    • this is an interesting value and might be easier to implement on the server-side if this value is already stored in a database and/or the authority record data itself. A concern or consideration is whether this value (and the Etag value) should correspond to the last modified date of the 'abstract record information' (e.g. a MARC record) or the actual digital representation (serialization) of the information (which could vary with the serialization vocabulary, rdf format, human language, data encoding, encryption, etc.). If the serialization uses different vocabularies or conceptual entities or relations, even when the 'abstract record information' has not changed, this value might be the date of serializing the data in the revised structure. Anyhow, this value may not be as tightly coupled to the serialization of the information as the other header values like the 'Content-Length', 'Content-MD5' and 'Etag' (these latter values are more specific to the digital stream data). That is, if the 'Last-Modified' value is not tied to serializations, then it could be the same value across different serializations (rdf/xml, turtle, json-ld, ntriples, etc.). It might represent the last date the 'abstract record information' has changed.

From the client perspective, an interesting problem is how and why to use one or another of the cache headers. If the concern is caching the data at the byte level, the MD5 or other hashes are most useful, so perhaps the Etag is great for that. If the concern is the information, regardless of serialization format and byte representation, perhaps the 'Last-Modified' header is more interesting.

From the server side, the load on the system might be easier if the system is able to calculate a hash value for each serialization once and deliver it from a db/cache store for every HEAD request (without calculating it on the fly every time). If the server system calculates the 'Content-MD5' or 'Etag' for every response, that's going to show up in slow server performance (esp. CPU thrashing to calculate hashes). A good server stack should have this issue resolved already with an optimized cache system. With regard to 'Last-Modified', this could be even easier on the server side, if the value is derived from a db/cache date value (which is likely updated infrequently for most authority information).

If a 'Last-Modified' value is available, it could be very useful to have documentation on exactly what it means and how it might change. That explanation might correspond to policy decisions and practices for updating content. From a systems perspective, it might also provide an opportunity to index the values and provide an periodic 'updated' API that is consistent with 'If-Last-Modified' requests. If that exists, then it could be used to quickly identify only the records that have recently changed (or changed since a given date parameter). The idea here is that an 'updated' API might be more efficient than individual HEAD/GET requests on every resource (URI) to check all the 'If-Last-Modified' responses. (The context is some form of regular update that is not an entire dB bulk update and not an individual resource update, but something like an update to a subset/collection of resources.)

A consideration in this regard is explicit control over the 'Expires' header for records that have changed (14.21 at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html). Similarly, the update issue also involves the 'If-Modified-Since' client header in a HEAD or GET request. The server can simply respond with a 304 (not modified) code; see 14.25 at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html -- "The If-Modified-Since request-header field is used with a method to make it conditional: if the requested variant has not been modified since the time specified in this field, an entity will not be returned from the server; instead, a 304 (not modified) response will be returned without any message-body. ... To get best results when sending an If- Modified-Since header field for cache validation, clients are advised to use the exact date string received in a previous Last-Modified header field whenever possible. " So, this indicates the importance of the 'Last-Modified' date.

Bear in mind that we are on the verge of HTTP/2, so that will be interesting too! e.g.
http://http2.github.io/faq/

@dazza-codes
Copy link
Owner Author

Notes on using curl, e.g.

$ curl -I http://id.loc.gov/authorities/subjects/sh85000399
HTTP/1.1 303 SEE OTHER
Location: http://id.loc.gov/authorities/subjects/sh85000399.html
Vary: Accept
X-URI: http://id.loc.gov/authorities/subjects/sh85000399
X-PrefLabel: Accordion music (Jazz)
Server: Apache
Content-Length: 0
Accept-Ranges: bytes
Date: Sun, 08 Mar 2015 15:40:45 GMT
X-Varnish: 817484216
Age: 0
Via: 1.1 varnish
Connection: keep-alive

curl will not automatically follow a redirect (303), unless the -L option is used. So it can be helpful to see the details of redirections. For this example, it redirects to an HTML page, but we want the RDF, e.g.

$ curl -I http://id.loc.gov/authorities/subjects/sh85000399.rdf
HTTP/1.1 200 OK
Content-type: application/rdf+xml
Cache-Control: public, max-age=43200
ETag: a906603072f5c988349b027364a6ef43
X-URI: http://id.loc.gov/authorities/subjects/sh85000399
Server: Apache
Content-Length: 5411
Accept-Ranges: bytes
Date: Sun, 08 Mar 2015 15:43:13 GMT
X-Varnish: 817485163
Age: 0
Via: 1.1 varnish
Connection: keep-alive

@dazza-codes
Copy link
Owner Author

At the client side, this might be partially solved by using https://github.com/crohr/rest-client-components, however the longer-term caching may require noting cache control data in RDF provenance details of some kind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant