ubelt.util_cache module

This module exposes Cacher and CacheStamp classes, which provide a simple API for on-disk caching.

The Cacher class is the simplest and most direct method of caching. In fact, it only requires four lines of boilderplate, which is the smallest general and robust way that I (Jon Crall) have achieved, and I don’t think its possible to do better. These four lines implement the following necessary and sufficient steps for general robust on-disk caching.

  1. Defining the cache dependenies

  2. Checking if the cache missed

  3. Loading the cache on a hit

  4. Executing the process and saving the result on a miss.

The following example illustrates these four points.

Example

>>> import ubelt as ub
>>> # Define a cache name and dependencies (which is fed to `ub.hash_data`)
>>> cacher = ub.Cacher('name', depends='set-of-deps')  # boilerplate:1
>>> # Calling tryload will return your data on a hit and None on a miss
>>> data = cacher.tryload(on_error='clear')            # boilerplate:2
>>> # Check if you need to recompute your data
>>> if data is None:                                   # boilerplate:3
>>>     # Your code to recompute data goes here (this is not boilerplate).
>>>     data = 'mydata'
>>>     # Cache the computation result (via pickle)
>>>     cacher.save(data)                              # boilerplate:4

Surprisingly this uses just as many boilerplate lines as a decorator style cacher, but it is much more extensible. It is possible to use Cacher in more sophisticated ways (e.g. metadata), but the simple in-line use is often easier and cleaner. The following example illustrates this:

Example

>>> import ubelt as ub
>>> @ub.Cacher('name', depends={'dep1': 1, 'dep2': 2})  # boilerplate:1
>>> def func():                                         # boilerplate:2
>>>     data = 'mydata'
>>>     return data                                     # boilerplate:3
>>> data = func()                                       # boilerplate:4
>>> cacher = ub.Cacher('name', depends=['dependencies'])  # boilerplate:1
>>> data = cacher.tryload(on_error='clear')               # boilerplate:2
>>> if data is None:                                      # boilerplate:3
>>>     data = 'mydata'
>>>     cacher.save(data)                                 # boilerplate:4

While the above two are equivalent, the second version provides simpler tracebacks, explicit procedures, and makes it easier to use breakpoint debugging (because there is no closure scope).

While Cacher is used to store direct results of in-line code in a pickle format, the CacheStamp object is used to cache processes that produces an on-disk side effects other than the main return value. For instance, consider the following example:

Example

>>> def compute_many_files(dpath):
...     for i in range(0):
...         fpath = '{}/file{}.txt'.format(dpath, i)
...         open(fpath).write('foo' + str(i))
>>> #
>>> import ubelt as ub
>>> dpath = ub.ensure_app_cache_dir('ubelt/demo/cache')
>>> ub.delete(dpath)  # start fresh
>>> # You must specify a directory, unlike in Cacher where it is optional
>>> self = ub.CacheStamp('name', dpath=dpath, depends={'a': 1, 'b': 2})
>>> if self.expired():
>>>     compute_many_files(dpath)
>>>     # Instead of caching the whole processes, we just write a file
>>>     # that signals the process has been done.
>>>     self.renew()
>>> assert not self.expired()

Todo

  • [ ] Remove the cfgstr-overrides?

class ubelt.util_cache.Cacher(fname, depends=None, dpath=None, appname='ubelt', ext='.pkl', meta=None, verbose=None, enabled=True, log=None, hasher='sha1', protocol=- 1, cfgstr=None)[source]

Bases: object

Cacher designed to be quickly integrated into existing scripts.

A dependency string can be specified, which will invalidate the cache if it changes to an unseen value. The location

Parameters
  • fname (str) – A file name. This is the prefix that will be used by the cache. It will always be used as-is.

  • depends (str | List[str] | None) – Indicate dependencies of this cache. If the dependencies change, then the cache is recomputed. New in version 0.8.9, replaces cfgstr.

  • dpath (str | PathLike | None) – Specifies where to save the cache. If unspecified, Cacher defaults to an application resource dir as given by appname.

  • appname (str, default=’ubelt’) – Application name Specifies a folder in the application resource directory where to cache the data if dpath is not specified.

  • ext (str, default=’.pkl’) – File extension for the cache format

  • meta (object | None) – Metadata that is also saved with the cfgstr. This can be useful to indicate how the cfgstr was constructed.

  • verbose (int, default=1) – Level of verbosity. Can be 1, 2 or 3.

  • enabled (bool, default=True) – If set to False, then the load and save methods will do nothing.

  • log (Callable[[str], Any]) – Overloads the print function. Useful for sending output to loggers (e.g. logging.info, tqdm.tqdm.write, …)

  • hasher (str, default=’sha1’) – Type of hashing algorithm to use if cfgstr needs to be condensed to less than 49 characters.

  • protocol (int, default=-1) – Protocol version used by pickle. Defaults to the -1 which is the latest protocol. If python 2 compatibility is not required, set to 2.

  • cfgstr (str | None) – Deprecated in favor of depends. Indicates the state. Either this string or a hash of this string will be used to identify the cache. A cfgstr should always be reasonably readable, thus it is good practice to hash extremely detailed cfgstrs to a reasonable readable level. Use meta to store make original details persist.

Example

>>> import ubelt as ub
>>> depends = 'repr-of-params-that-uniquely-determine-the-process'
>>> # Create a cacher and try loading the data
>>> cacher = ub.Cacher('demo_process', depends, verbose=4)
>>> cacher.clear()
>>> data = cacher.tryload()
>>> if data is None:
>>>     # Put expensive functions in if block when cacher misses
>>>     myvar1 = 'result of expensive process'
>>>     myvar2 = 'another result'
>>>     # Tell the cacher to write at the end of the if block
>>>     # It is idomatic to put results in an object named data
>>>     data = myvar1, myvar2
>>>     cacher.save(data)
>>> # Last part of the Cacher pattern is to unpack the data object
>>> myvar1, myvar2 = data
>>> #
>>> # If we know the data exists, we can also simply call load
>>> data = cacher.tryload()

Example

>>> # The previous example can be shorted if only a single value
>>> from ubelt.util_cache import Cacher
>>> depends = 'repr-of-params-that-uniquely-determine-the-process'
>>> # Create a cacher and try loading the data
>>> cacher = Cacher('demo_process', depends)
>>> myvar = cacher.tryload()
>>> if myvar is None:
>>>     myvar = ('result of expensive process', 'another result')
>>>     cacher.save(myvar)
>>> assert cacher.exists(), 'should now exist'
VERBOSE = 1
FORCE_DISABLE = False
get_fpath(cfgstr=None)[source]

Reports the filepath that the cacher will use.

It will attempt to use ‘{fname}_{cfgstr}{ext}’ unless that is too long. Then cfgstr will be hashed.

Parameters

cfgstr (str | None) – overrides the instance-level cfgstr

Returns

str | PathLike

Example

>>> # xdoctest: +REQUIRES(module:pytest)
>>> from ubelt.util_cache import Cacher
>>> import pytest
>>> with pytest.warns(UserWarning):
>>>     cacher = Cacher('test_cacher1')
>>>     cacher.get_fpath()
>>> self = Cacher('test_cacher2', depends='cfg1')
>>> self.get_fpath()
>>> self = Cacher('test_cacher3', depends='cfg1' * 32)
>>> self.get_fpath()
exists(cfgstr=None)[source]

Check to see if the cache exists

Parameters

cfgstr (str | None) – overrides the instance-level cfgstr

Returns

bool

existing_versions()[source]

Returns data with different cfgstr values that were previously computed with this cacher.

Yields

str – paths to cached files corresponding to this cacher

Example

>>> from ubelt.util_cache import Cacher
>>> # Ensure that some data exists
>>> known_fpaths = set()
>>> import ubelt as ub
>>> dpath = ub.ensure_app_cache_dir('ubelt',
>>>                                 'test-existing-versions')
>>> ub.delete(dpath)  # start fresh
>>> cacher = Cacher('versioned_data_v2', depends='1', dpath=dpath)
>>> cacher.ensure(lambda: 'data1')
>>> known_fpaths.add(cacher.get_fpath())
>>> cacher = Cacher('versioned_data_v2', depends='2', dpath=dpath)
>>> cacher.ensure(lambda: 'data2')
>>> known_fpaths.add(cacher.get_fpath())
>>> # List previously computed configs for this type
>>> from os.path import basename
>>> cacher = Cacher('versioned_data_v2', depends='2', dpath=dpath)
>>> exist_fpaths = set(cacher.existing_versions())
>>> exist_fnames = list(map(basename, exist_fpaths))
>>> print('exist_fnames = {!r}'.format(exist_fnames))
>>> assert exist_fpaths.issubset(known_fpaths)
clear(cfgstr=None)[source]

Removes the saved cache and metadata from disk

Parameters

cfgstr (str | None) – overrides the instance-level cfgstr

tryload(cfgstr=None, on_error='raise')[source]

Like load, but returns None if the load fails due to a cache miss.

Parameters
  • cfgstr (str | None) – overrides the instance-level cfgstr

  • on_error (str, default=’raise’) – How to handle non-io errors errors. Either ‘raise’, which re-raises the exception, or ‘clear’ which deletes the cache and returns None.

Returns

the cached data if it exists, otherwise returns None

Return type

None | object

load(cfgstr=None)[source]

Load the data cached and raise an error if something goes wrong.

Parameters

cfgstr (str | None) – overrides the instance-level cfgstr

Returns

the cached data

Return type

object

Raises

IOError - if the data is unable to be loaded. This could be due to – a cache miss or because the cache is disabled.

Example

>>> from ubelt.util_cache import *  # NOQA
>>> # Setting the cacher as enabled=False turns it off
>>> cacher = Cacher('test_disabled_load', '', enabled=True)
>>> cacher.save('data')
>>> assert cacher.load() == 'data'
>>> cacher.enabled = False
>>> assert cacher.tryload() is None
save(data, cfgstr=None)[source]

Writes data to path specified by self.fpath(cfgstr).

Metadata containing information about the cache will also be appended to an adjacent file with the .meta suffix.

Parameters
  • data (object) – arbitrary pickleable object to be cached

  • cfgstr (str | None) – overrides the instance-level cfgstr

Example

>>> from ubelt.util_cache import *  # NOQA
>>> # Normal functioning
>>> depends = 'long-cfg' * 32
>>> cacher = Cacher('test_enabled_save', depends=depends)
>>> cacher.save('data')
>>> assert exists(cacher.get_fpath()), 'should be enabeled'
>>> assert exists(cacher.get_fpath() + '.meta'), 'missing metadata'
>>> # Setting the cacher as enabled=False turns it off
>>> cacher2 = Cacher('test_disabled_save', 'params', enabled=False)
>>> cacher2.save('data')
>>> assert not exists(cacher2.get_fpath()), 'should be disabled'
ensure(func, *args, **kwargs)[source]

Wraps around a function. A cfgstr must be stored in the base cacher.

Parameters
  • func (Callable) – function that will compute data on cache miss

  • *args – passed to func

  • **kwargs – passed to func

Example

>>> from ubelt.util_cache import *  # NOQA
>>> def func():
>>>     return 'expensive result'
>>> fname = 'test_cacher_ensure'
>>> depends = 'func params'
>>> cacher = Cacher(fname, depends=depends)
>>> cacher.clear()
>>> data1 = cacher.ensure(func)
>>> data2 = cacher.ensure(func)
>>> assert data1 == 'expensive result'
>>> assert data1 == data2
>>> cacher.clear()
class ubelt.util_cache.CacheStamp(fname, dpath, cfgstr=None, product=None, hasher='sha1', verbose=None, enabled=True, depends=None, meta=None)[source]

Bases: object

Quickly determine if a file-producing computation has been done.

Writes a file that marks that a procedure has been done by writing a “stamp” file to disk. Removing the stamp file will force recomputation. However, removing or changing the result of the computation may not trigger recomputation unless specific handling is done or the expected “product” of the computation is a file and registered with the stamper. If hasher is None, we only check if the product exists, and we ignore its hash, otherwise it checks that the hash of the product is the same.

Parameters
  • fname (str) – Name of the stamp file

  • dpath (str | PathLike | None) – Where to store the cached stamp file

  • product (str | PathLike | Sequence[str | PathLike] | None) – Path or paths that we expect the computation to produce. If specified the hash of the paths are stored.

  • hasher (str, default=’sha1’) – The type of hasher used to compute the file hash of product. If None, then we assume the file has not been corrupted or changed. Defaults to sha1.

  • verbose (bool, default=None) – Passed to internal ub.Cacher object

  • enabled (bool, default=True) – if False, expired always returns True

  • depends (str | List[str] | None) – Indicate dependencies of this cache. If the dependencies change, then the cache is recomputed. New to CacheStamp in version 0.9.2, replaces cfgstr.

  • meta (object | None) – Metadata that is also saved with the cfgstr. This can be useful to indicate how the cfgstr was constructed. New to CacheStamp in version 0.9.2.

  • cfgstr (str | None) – DEPRECATED in favor or depends. Configuration associated with the stamped computation. A common pattern is to call ubelt.hash_data() on a dependency list.

    Deprecated in favor of depends. Indicates the state. Either this string or a hash of this string will be used to identify the cache. A cfgstr should always be reasonably readable, thus it is good practice to hash extremely detailed cfgstrs to a reasonable readable level. Use meta to store make original details persist.

Todo

  • [ ] expiration time delta or date time (also remember when renewed)

Example

>>> import ubelt as ub
>>> from os.path import join
>>> # Stamp the computation of expensive-to-compute.txt
>>> dpath = ub.ensure_app_cache_dir('ubelt', 'test-cache-stamp')
>>> ub.delete(dpath)
>>> ub.ensuredir(dpath)
>>> product = join(dpath, 'expensive-to-compute.txt')
>>> self = CacheStamp('somedata', depends='someconfig', dpath=dpath,
>>>                   product=product, hasher=None)
>>> self.hasher = None
>>> if self.expired():
>>>     ub.writeto(product, 'very expensive')
>>>     self.renew()
>>> assert not self.expired()
>>> # corrupting the output will not expire in non-robust mode
>>> ub.writeto(product, 'corrupted')
>>> assert not self.expired()
>>> self.hasher = 'sha1'
>>> # but it will expire if we are in robust mode
>>> assert self.expired()
>>> # deleting the product will cause expiration in any mode
>>> self.hasher = None
>>> ub.delete(product)
>>> assert self.expired()
expired(cfgstr=None, product=None)[source]

Check to see if a previously existing stamp is still valid and if the expected result of that computation still exists.

Parameters
  • cfgstr (str | None) – overrides the instance-level cfgstr

  • product (str | PathLike | Sequence[str | PathLike] | None) – override the default product if specified

Returns

True if the stamp is invalid or does not exist.

Return type

bool

renew(cfgstr=None, product=None)[source]

Recertify that the product has been recomputed by writing a new certificate to disk.

Returns

certificate information

Return type

dict