-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use cache expiry #1
Comments
For example, the HEAD request on this LOC authority
The useful data for caching here is In theory, the max-age should correspond to the frequency that LOC updates this data on their servers. In practice, they don't want to manage these details, so the policy is to have the data cached only for 12 hours so that whenever it is updated, the caches will be refreshed within a day. It may be preferable for LOC to set the Another option: if LOC sets the Another option: LOC could set the Note on varnish headers: All requets in varnish are assigned a XID number, the X-Varnish tells you what it is, and if a cache-hit was involved, also the XID of the transaction that put the object in the cache. The Age header tells how long a particular object has been in varnish's cache. |
There is additional administration metadata in the RDF response; e.g. consider the RDF body in a GET request for http://id.loc.gov/authorities/names/n79044798.rdf However, the GET request incurrs more work on the server to construct the body, the network to deliver the packets, and the client to parse the content. In addition, in this case, there is no information about the data that is designed for caching it. SKOS metadata:
MADS-RDF metadata:
|
The headers on a VIAF resource appear to have no cache control, e.g. a HEAD or GET on
Also, there is no RDF metadata about the date of creation or revision for the RDF resource at VIAF. |
Notes on evaluating cache-control information. First, this is relatively easy using Firefox with a plugin for HTTP resource testing, because it gives easy access to specify additional HTTP request types, like HEAD, and additional request headers. See [https://addons.mozilla.org/en-us/firefox/addon/http-resource-test/] When this is installed in firefox, it's available from the Tools > HTTP Resource Test menu. So, first paste in the example URI as 'http://id.loc.gov/authorities/subjects/sh85000399' in the top left box (URI) and then select the 'HEAD' request from the drop-down menu at the top right (Method). (Set no additional parameters in the 'Client Request' for the moment.) Then hit the submit button and the server response comes back in the 'Server Response' panels below, one for 'Headers' and one for the 'Body'. Before reviewing the details of the response, let's be on the same page with regard to the meaning and purpose of a HEAD request. The HEAD request should retrieve only the response headers, without the body. The HEAD response for [http://id.loc.gov/authorities/subjects/sh85000399] does include a body; that MUST disappear. The spec is described here: To quote that document (my italics and my emphasis in bold):
This is the HEAD header response from [http://id.loc.gov/authorities/subjects/sh85000399](I'm not going to review the HEAD body response, it should not exist): HTTP/1.1 200 OK I'll try to explain how I currently understand the useful cache-control values and what can be used and might be added. A quick caveat -- my expertise in this area is limited and an authoritative understanding should be gained from consulting the standards from W3C, see mainly section 13 and especially 14.9 at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html -- I'm using this document because it's somewhat easier to navigate and read, but there are more recent updates on this spec ( RFC2616 was replaced by multiple RFCs 7230-7237), although they might not yet be implemented in server/client code -- anyhow, I'm assuming we are working with HTTP/1.1 systems; specific sections noted below.
So these related client cache control header requests have implications for how the server responds to new requests:
From the client perspective, an interesting problem is how and why to use one or another of the cache headers. If the concern is caching the data at the byte level, the MD5 or other hashes are most useful, so perhaps the Etag is great for that. If the concern is the information, regardless of serialization format and byte representation, perhaps the 'Last-Modified' header is more interesting. From the server side, the load on the system might be easier if the system is able to calculate a hash value for each serialization once and deliver it from a db/cache store for every HEAD request (without calculating it on the fly every time). If the server system calculates the 'Content-MD5' or 'Etag' for every response, that's going to show up in slow server performance (esp. CPU thrashing to calculate hashes). A good server stack should have this issue resolved already with an optimized cache system. With regard to 'Last-Modified', this could be even easier on the server side, if the value is derived from a db/cache date value (which is likely updated infrequently for most authority information). If a 'Last-Modified' value is available, it could be very useful to have documentation on exactly what it means and how it might change. That explanation might correspond to policy decisions and practices for updating content. From a systems perspective, it might also provide an opportunity to index the values and provide an periodic 'updated' API that is consistent with 'If-Last-Modified' requests. If that exists, then it could be used to quickly identify only the records that have recently changed (or changed since a given date parameter). The idea here is that an 'updated' API might be more efficient than individual HEAD/GET requests on every resource (URI) to check all the 'If-Last-Modified' responses. (The context is some form of regular update that is not an entire dB bulk update and not an individual resource update, but something like an update to a subset/collection of resources.) A consideration in this regard is explicit control over the 'Expires' header for records that have changed (14.21 at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html). Similarly, the update issue also involves the 'If-Modified-Since' client header in a HEAD or GET request. The server can simply respond with a 304 (not modified) code; see 14.25 at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html -- "The If-Modified-Since request-header field is used with a method to make it conditional: if the requested variant has not been modified since the time specified in this field, an entity will not be returned from the server; instead, a 304 (not modified) response will be returned without any message-body. ... To get best results when sending an If- Modified-Since header field for cache validation, clients are advised to use the exact date string received in a previous Last-Modified header field whenever possible. " So, this indicates the importance of the 'Last-Modified' date. Bear in mind that we are on the verge of HTTP/2, so that will be interesting too! e.g. |
Notes on using
|
At the client side, this might be partially solved by using https://github.com/crohr/rest-client-components, however the longer-term caching may require noting cache control data in RDF provenance details of some kind. |
When storing retrieved RDF in a local cache, try to add a triple that represents any cache expiry headers from the original source of the RDF (LOC, VIAF, ISNI, OCLC, etc.). Create a cron job process that can query the local cache to extract entities with cache-control data that has expired, then use a background queue that runs processes with a 'nice' priority to retrieve and update the expired RDF.
If mongodb is the local cache, create an index on the cache-control data. Where possible, encourage RDF providers to use cache-control headers with reasonable values that correspond to their routine update cycles (if they are routine). If an entire repository has the same expiry, it might be readily stored in only a few triples.
The text was updated successfully, but these errors were encountered: