Skip to content

Commit

Permalink
Add cache management features (#799)
Browse files Browse the repository at this point in the history
* Add cache management features.

We add a cache management layer on top of Pystow. This takes the form of
two classes (both in `oaklib.utilities.caching`):

* one representing the cache management policy, i.e. the logic dictating
  whether a cached file (if present) should be refreshed or not;
* one representing the file cache itself.

The policy is set once by the main entry point method, using either a
default policy of refreshing cached data after 7 days, or another policy
explicitly selected by the user with the new `--caching` option.

The class that represents the file cache is the one that the rest of OAK
should interact with whenever an access to caching data is needed.
Ultimately, all calls to the Pystow module should be replaced to calls
to FileCache, the use of Pystow becoming an implementation detail
entirely encapsulated in FileCache.

* Re-implement cache-ls and cache-clear.

Add new methods to the FileCache class to (1) get the list of files
present in the cache and (2) delete files in the cache.

Replace the implementations of the cache-ls and cache-clear commands to
use the new methods, so that the details of cache listing and clearing
remain encapsulated in FileCache.

As a side-effect, this automatically fixes the issue that cache listing
was only working on Unix-like systems, since the FileCache
implementation is pure Python and does not rely on the ls(1) Unix
command.

* Implement the cache reset policy.

The intended difference between the REFRESH and RESET caching policies
is that, when a cache lookup is attempted, REFRESH should cause the file
that was looked up -- and only that file -- to be refreshed, leaving any
other file that may be present in the cache untouched. RESET, on the
other hand, should entirely clear the cache, so that not only the file
that was looked up should be refreshed, but any other file that may
looked up in a subsequent call should be refreshed as well.

This commit implements the intended behaviour for the RESET policy.

* Fix forced refresh for future timestamps and add tests.

In principle, we should never have to compare a timestamp representing a
future date when we check whether a cached file should be refreshed.
However, files with bogus mtime values and/or computers configured with
a bogus system time are certainly not uncommon, so encountering a
timestamp higher than the current time can (and will) definitely happen.

Under an "always refresh" policy, a refresh must be triggered even if
the cached file appears to "newer than now", so we explicitly implement
that behaviour here.

We also add a complete test fixture for the CachePolicy class.

* Add some documentation for --caching.

In the SQLite tutorial, in the section that briefly mentions that
automatically downloaded SQLite files are cached in ``.data/oaklib``, we
describe in more details how the cache works and how it can be
controlled using the `--caching` option.

* Add complete documentation for the `--caching` option.

Add a new section in the CLI reference documentation to explain how the
cache works and how it can be controlled using the `--caching` option.

Replace the previous, shorter documentation in the SQLite tutorial by a
simple mention of the cache with a link to the newly added reference
section.

* Allow controlling the cache through a configuration file.

This commit adds the possibility to configure the file cache to apply
pattern-specific caching policies. This is controlled by a configuration
file ($XDG_CONFIG_HOME/ontology-access-kit/cache.conf, under GNU/Linux)
containing "pattern=policy" pairs, where pattern is a shell-type
globbing pattern and policy is a string of the same type as expected by
the newly introduced --caching option.

* Misc documentation fix.

The "user_config_dir" returned by the Appdirs package under macOS is not
in "~/Library/Prefences" but under "~/Library/Application Support"
(Appdirs documentation is not up to date).

Also, there is no need to mention the roaming directory under Windows,
as Appdirs will never use that directory unless we explicitly asks it do
so (which we don't).

There is also no need for a show_default=True parameter with the
--caching option, since that option has _no_ default.
  • Loading branch information
gouttegd authored Aug 22, 2024
1 parent c93a9dc commit ecfa132
Show file tree
Hide file tree
Showing 9 changed files with 681 additions and 23 deletions.
102 changes: 102 additions & 0 deletions docs/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,108 @@ and tracing upwards through is_a and part_of relationships:
uberon viz -p i,p hand foot
Cache Control
-------------

OAK may download data from remote sources as part of its normal operations. For
example, using the :code:`sqlite:obo:...` input selector will cause OAK to
fetch the requested Semantic-SQL database from a centralised repository.
Whenever that happens, the downloaded data will be cached in a local directory
so that subsequent commands using the same input selector do not have to
download the file again.

By default, OAK will refresh (download again) a previously downloaded file if
it was last downloaded more than 7 days ago.

The behavior of the cache can be controlled in two ways: with an option on the
command line and with a configuration file.

Controlling the cache on the command line
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The global option :code:`--caching` gives the user some control on how the
cache works.

To change the default cache expiry lifetime of 7 days, the :code:`--caching`
option accepts a value of the form :code:`ND`, where *N* is a positive integer
and *D* can be either :code:`s`, :code:`d`, :code:`w`, :code:`m`, or :code:`y`
to indicate that *N* is a number of seconds, days, weeks, months, or years,
respectively. If the *D* part is omitted, it defaults to :code:`d`.

For example, :code:`--caching=3w` instructs OAK to refresh a cached file if it
was last refreshed 21 days ago.

The :code:`--caching` option also accepts the following special values:

- :code:`refresh` to force OAK to always refresh a file regardless of its age;
- :code:`no-refresh` to do the opposite, that is, preventing OAK from
refreshing a file regardless of its age;
- :code:`clear` to forcefully clear the cache (which will trigger a refresh as
a consequence);
- :code:`reset` is a synonym of :code:`clear`.

Note the difference between :code:`refresh` and :code:`clear`. The former will
only cause the requested file to be refreshed, leaving any other file that may
exist in the cache untouched. The latter will delete all cached files, so that
not only the requested file will be downloaded again, but any other
previously cached file will also have to be downloaded again the next time they
are requested.

In both case, refreshing and clearing will only happen if the OAK command in
which the :code:`--caching` option is used attempts to look up a cached file.
Otherwise the option will have no effect.

To forcefully clear the cache independently of any command, the
:ref:`cache-clear` command may be used. The contents of the cache may be
explored at any time with the :ref:`cache-ls` command.

Controlling the cache with a configuration file
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Finer control of how the cache works is possible through a configuration file
that OAK will look up for at the following locations:

- under GNU/Linux: in ``$XDG_CONFIG_HOME/ontology-access-kit/cache.conf``;
- under macOS: in ``$HOME/Library/Application Support/ontology-access-kit/cache.conf``;
- under Windows: in ``%LOCALAPPDATA%\ontology-access-kit\ontology-access-kit\cache.conf``.

The file should contain lines of the form :code:`pattern = policy`, where:

- *pattern* is a shell-type globbing pattern indicating the files that will be
concerned by the policy set forth on the line;
- *policy* is the same type of value as expected by the :code:`--caching`
option as explained in the previous section.

Blank lines and lines starting with :code:`#` are ignored.

If the *pattern* is :code:`default` (or :code:`*`), the corresponding policy
will be used for any cached file that does not have a matching policy.

Here is a sample configuration file:

.. code-block::
# Uberon will be refreshed if older than 1 month
uberon.db = 1m
# FBbt will be refreshed if older than 2 weeks
fbbt.db = 2w
# Other FlyBase ontologies will be refreshed if older than 2 months
fb*.db = 2m
# All other files will be refreshed if older than 3 weeks
default = 3w
Note that when looking up the policy to apply to a given file, patterns are
tried in the order they appear in the file. This is why the :code:`fbbt.db`
pattern in the example above must be listed *before* the less specific
:code:`fb*.db` pattern, otherwise it would be ignored. (This does not apply to
the default pattern -- whether it is specified as :code:`default` or as
:code:`*` -- which is always tried after all the other patterns.)

The :code:`--caching` option described in the previous section always takes
precedence over the configuration file. That is, all rules set forth in the
configuration will be ignored if the :code:`--caching` option is specified on
the command line.

Commands
-----------

Expand Down
4 changes: 4 additions & 0 deletions docs/intro/tutorial07.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,10 @@ This will download the pato.db sqlite file once, and cache it.

PyStow is used to cache the file, and the default location is ``~/.data/oaklib``.

By default, a cached SQLite file will be automatically refreshed (downloaded
again) if it is older than 7 days. For details on how to alter the behavior of
the cache, see the :ref:`Cache Control` section in the CLI documentation.

Building your own SQLite files
-------------------

Expand Down
38 changes: 19 additions & 19 deletions src/oaklib/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,12 @@
# See https://stackoverflow.com/questions/47972638/how-can-i-define-the-order-of-click-sub-commands-in-help
import json
import logging
import os
import statistics as stats
import sys
from collections import defaultdict
from enum import Enum, unique
from itertools import chain
from pathlib import Path
from time import time
from types import ModuleType
from typing import (
Any,
Expand All @@ -28,7 +26,6 @@

import click
import kgcl_schema.grammar.parser as kgcl_parser
import pystow
import sssom.writers as sssom_writers
import sssom_schema
import yaml
Expand All @@ -42,6 +39,7 @@

import oaklib.datamodels.taxon_constraints as tcdm
from oaklib import datamodels
from oaklib.constants import FILE_CACHE
from oaklib.converters.logical_definition_flattener import LogicalDefinitionFlattener
from oaklib.datamodels import synonymizer_datamodel
from oaklib.datamodels.association import RollupGroup
Expand Down Expand Up @@ -149,6 +147,7 @@
generate_disjoint_class_expressions_axioms,
)
from oaklib.utilities.basic_utils import pairs_as_dict
from oaklib.utilities.caching import CachePolicy
from oaklib.utilities.iterator_utils import chunk
from oaklib.utilities.kgcl_utilities import (
generate_change_id,
Expand Down Expand Up @@ -568,6 +567,11 @@ def _apply_changes(impl, changes: List[kgcl.Change]):
show_default=True,
help="If set, will profile the command",
)
@click.option(
"--caching",
type=CachePolicy.ClickType,
help="Set the cache management policy",
)
def main(
verbose: int,
quiet: bool,
Expand All @@ -587,6 +591,7 @@ def main(
prefix,
profile: bool,
import_depth: Optional[int],
caching: Optional[CachePolicy],
**kwargs,
):
"""
Expand Down Expand Up @@ -635,6 +640,7 @@ def exit():
import requests_cache

requests_cache.install_cache(requests_cache_db)
FILE_CACHE.force_policy(caching)
resource = OntologyResource()
resource.slug = input
settings.autosave = autosave
Expand Down Expand Up @@ -5454,12 +5460,14 @@ def cache_ls():
"""
List the contents of the pystow oaklib cache.
TODO: this currently only works on unix-based systems.
"""
directory = pystow.api.join("oaklib")
command = f"ls -al {directory}"
click.secho(f"[pystow] {command}", fg="cyan", bold=True)
os.system(command) # noqa:S605
units = ["B", "KB", "MB", "GB", "TB"]
for path, size, mtime in FILE_CACHE.get_contents(subdirs=True):
i = 0
while size > 1024 and i < len(units) - 1:
size /= 1024
i += 1
click.echo(f"{path} ({size:.2f} {units[i]}, {mtime:%Y-%m-%d})")


@main.command()
Expand All @@ -5475,17 +5483,9 @@ def cache_clear(days_old: int):
Clear the contents of the pystow oaklib cache.
"""
directory = pystow.api.join("oaklib")
now = time()
for item in Path(directory).glob("*"):
if ".db" not in str(item):
continue
mtime = item.stat().st_mtime
curr_days_old = (int(now) - int(mtime)) / 86400
logging.info(f"{item} is {curr_days_old}")
if curr_days_old > days_old:
click.echo(f"Deleting {item} which is {curr_days_old}")
item.unlink()

for name, _, age in FILE_CACHE.clear(subdirs=False, older_than=days_old, pattern="*.db*"):
click.echo(f"Deleted {name} which was {age.days} days old")


@main.command()
Expand Down
4 changes: 4 additions & 0 deletions src/oaklib/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,13 @@

import pystow

from oaklib.utilities.caching import FileCache

__all__ = [
"OAKLIB_MODULE",
"FILE_CACHE",
]

OAKLIB_MODULE = pystow.module("oaklib")
FILE_CACHE = FileCache(OAKLIB_MODULE)
TIMEOUT_SECONDS = 30
4 changes: 2 additions & 2 deletions src/oaklib/implementations/llm_implementation.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
from dataclasses import dataclass
from typing import TYPE_CHECKING, Dict, Iterable, Iterator, List, Optional, Tuple

import pystow
from linkml_runtime.dumpers import yaml_dumper
from sssom_schema import Mapping
from tenacity import (
Expand All @@ -19,6 +18,7 @@
)

from oaklib import BasicOntologyInterface
from oaklib.constants import FILE_CACHE
from oaklib.datamodels.class_enrichment import ClassEnrichmentResult
from oaklib.datamodels.item_list import ItemList
from oaklib.datamodels.obograph import DefinitionPropertyValue
Expand Down Expand Up @@ -148,7 +148,7 @@ def config_to_prompt(configuration: Optional[ValidationConfiguration]) -> Option

for obj in configuration.documentation_objects:
if obj.startswith("http:") or obj.startswith("https:"):
path = pystow.ensure("oaklib", "documents", url=obj)
path = FILE_CACHE.ensure("documents", url=obj)
else:
path = obj
with open(path) as f:
Expand Down
4 changes: 2 additions & 2 deletions src/oaklib/implementations/sqldb/sql_implementation.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@

import oaklib.datamodels.ontology_metadata as om
import oaklib.datamodels.validation_datamodel as vdm
from oaklib.constants import OAKLIB_MODULE
from oaklib.constants import FILE_CACHE
from oaklib.datamodels import obograph, ontology_metadata
from oaklib.datamodels.association import Association
from oaklib.datamodels.obograph import (
Expand Down Expand Up @@ -342,7 +342,7 @@ def __post_init__(self):
# Option 1 uses direct URL construction:
url = f"https://s3.amazonaws.com/bbop-sqlite/{prefix}.db.gz"
logging.info(f"Ensuring gunzipped for {url}")
db_path = OAKLIB_MODULE.ensure_gunzip(url=url, autoclean=False)
db_path = FILE_CACHE.ensure_gunzip(url=url, autoclean=False)
# Option 2 uses botocore to interface with the S3 API directly:
# db_path = OAKLIB_MODULE.ensure_from_s3(s3_bucket="bbop-sqlite", s3_key=f"{prefix}.db")
locator = f"sqlite:///{db_path}"
Expand Down
Loading

0 comments on commit ecfa132

Please sign in to comment.