ubelt.util_cache module

This module exposes Cacher and CacheStamp classes, which provide a simple API for on-disk caching.

The Cacher class is the simplest and most direct method of caching. In fact, it only requires four lines of boilerplate, which is the smallest general and robust way that I (Jon Crall) have achieved, and I don’t think its possible to do better. These four lines implement the following necessary and sufficient steps for general robust on-disk caching.

  1. Defining the cache dependencies

  2. Checking if the cache missed

  3. Loading the cache on a hit

  4. Executing the process and saving the result on a miss.

The following example illustrates these four points.

Example

>>> import ubelt as ub
>>> # Define a cache name and dependencies (which is fed to `ub.hash_data`)
>>> cacher = ub.Cacher('name', depends='set-of-deps')  # boilerplate:1
>>> # Calling tryload will return your data on a hit and None on a miss
>>> data = cacher.tryload(on_error='clear')            # boilerplate:2
>>> # Check if you need to recompute your data
>>> if data is None:                                   # boilerplate:3
>>>     # Your code to recompute data goes here (this is not boilerplate).
>>>     data = 'mydata'
>>>     # Cache the computation result (via pickle)
>>>     cacher.save(data)                              # boilerplate:4

Surprisingly this uses just as many boilerplate lines as a decorator style cacher, but it is much more extensible. It is possible to use Cacher in more sophisticated ways (e.g. metadata), but the simple in-line use is often easier and cleaner. The following example illustrates this:

Example

>>> import ubelt as ub
>>> @ub.Cacher('name', depends={'dep1': 1, 'dep2': 2})  # boilerplate:1
>>> def func():                                         # boilerplate:2
>>>     data = 'mydata'
>>>     return data                                     # boilerplate:3
>>> data = func()                                       # boilerplate:4
>>> cacher = ub.Cacher('name', depends=['dependencies'])  # boilerplate:1
>>> data = cacher.tryload(on_error='clear')               # boilerplate:2
>>> if data is None:                                      # boilerplate:3
>>>     data = 'mydata'
>>>     cacher.save(data)                                 # boilerplate:4

While the above two are equivalent, the second version provides a simpler traceback, explicit procedures, and makes it easier to use breakpoint debugging (because there is no closure scope).

While Cacher is used to store direct results of in-line code in a pickle format, the CacheStamp object is used to cache processes that produces an on-disk side effects other than the main return value. For instance, consider the following example:

Example

>>> import ubelt as ub
>>> def compute_many_files(dpath):
...     for i in range(10):
...         fpath = '{}/file{}.txt'.format(dpath, i)
...         with open(fpath, 'w') as file:
...             file.write('foo' + str(i))
>>> dpath = ub.Path.appdir('ubelt/demo/cache').delete().ensuredir()
>>> # You must specify a directory, unlike in Cacher where it is optional
>>> self = ub.CacheStamp('name', dpath=dpath, depends={'a': 1, 'b': 2})
>>> if self.expired():
>>>     compute_many_files(dpath)
>>>     # Instead of caching the whole processes, we just write a file
>>>     # that signals the process has been done.
>>>     self.renew()
>>> assert not self.expired()

The CacheStamp is lightweight in that it simply marks that a process has been completed, but the job of saving / loading the actual data is left to the developer. The expired method checks if the stamp exists, and renew writes the stamp to disk.

In ubelt version 1.1.0, several additional features were added to CacheStamp. In addition to specifying parameters via depends, it is also possible for CacheStamp to determine if an associated file has been modified. To do this, the paths of the files must be known a-priori and passed to CacheStamp via the product argument. This will allow the CacheStamp to detect if the files have been modified since the renew method was called. It does this by remembering the size, modified time, and checksum of each file. If the hash of the expected hash of the product is known in advance, it is also possible to specify the expected hash_prefix of each product. In this case, renew will raise an Exception if this specified hash prefix does not match the files on disk. Lastly, it is possible to specify an expiration time via expires, after which the CacheStamp will always be marked as invalid. This is now the mechanism via which the cache in ubelt.util_download.grabdata() works.

Example

>>> import ubelt as ub
>>> dpath = ub.Path.appdir('ubelt/demo/cache').delete().ensuredir()
>>> params = {'a': 1, 'b': 2}
>>> expected_fpaths = [dpath / 'file{}.txt'.format(i) for i in range(2)]
>>> hash_prefix = ['a7a8a91659601590e17191301dc1',
...                '55ae75d991c770d8f3ef07cbfde1']
>>> self = ub.CacheStamp('name', dpath=dpath, depends=params,
>>>                      hash_prefix=hash_prefix, hasher='sha256',
>>>                      product=expected_fpaths, expires='2101-01-01T000000Z')
>>> if self.expired():
>>>     for fpath in expected_fpaths:
...         fpath.write_text(fpath.name)
>>>     self.renew()
>>> # modifying or removing the file will cause the stamp to expire
>>> expected_fpaths[0].write_text('corrupted')
>>> assert self.expired()
class ubelt.util_cache.Cacher(fname, depends=None, dpath=None, appname='ubelt', ext='.pkl', meta=None, verbose=None, enabled=True, log=None, hasher='sha1', protocol=- 1, cfgstr=None, backend='auto')[source]

Bases: object

Saves data to disk and reloads it based on specified dependencies.

Cacher uses pickle to save/load data to/from disk. Dependencies of the cached process can be specified, which ensures the cached data is recomputed if the dependencies change. If the location of the cache is not specified, it will default to the system user’s cache directory.

Parameters
  • fname (str) – A file name. This is the prefix that will be used by the cache. It will always be used as-is.

  • depends (str | List[str] | None) – Indicate dependencies of this cache. If the dependencies change, then the cache is recomputed. New in version 0.8.9, replaces cfgstr.

  • dpath (str | PathLike | None) – Specifies where to save the cache. If unspecified, Cacher defaults to an application cache dir as given by appname. See ub.get_app_cache_dir() for more details.

  • appname (str, default=’ubelt’) – Application name Specifies a folder in the application cache directory where to cache the data if dpath is not specified.

  • ext (str, default=’.pkl’) – File extension for the cache format. Can be '.pkl' or '.json'.

  • meta (object | None) – Metadata that is also saved with the cfgstr. This can be useful to indicate how the cfgstr was constructed. Note: this is a candidate for deprecation.

  • verbose (int, default=1) – Level of verbosity. Can be 1, 2 or 3.

  • enabled (bool, default=True) – If set to False, then the load and save methods will do nothing.

  • log (Callable[[str], Any]) – Overloads the print function. Useful for sending output to loggers (e.g. logging.info, tqdm.tqdm.write, …)

  • hasher (str, default=’sha1’) – Type of hashing algorithm to use if cfgstr needs to be condensed to less than 49 characters.

  • protocol (int, default=-1) – Protocol version used by pickle. Defaults to the -1 which is the latest protocol.

  • backend (str) – Set to either 'pickle' or 'json' to force backend. Defaults to auto which chooses one based on the extension.

  • cfgstr (str | None) – Deprecated in favor of depends.

Example

>>> import ubelt as ub
>>> depends = 'repr-of-params-that-uniquely-determine-the-process'
>>> # Create a cacher and try loading the data
>>> cacher = ub.Cacher('demo_process', depends, verbose=4)
>>> cacher.clear()
>>> print(f'cacher.fpath={cacher.fpath}')
>>> data = cacher.tryload()
>>> if data is None:
>>>     # Put expensive functions in if block when cacher misses
>>>     myvar1 = 'result of expensive process'
>>>     myvar2 = 'another result'
>>>     # Tell the cacher to write at the end of the if block
>>>     # It is idomatic to put results in an object named data
>>>     data = myvar1, myvar2
>>>     cacher.save(data)
>>> # Last part of the Cacher pattern is to unpack the data object
>>> myvar1, myvar2 = data
>>> #
>>> # If we know the data exists, we can also simply call load
>>> data = cacher.tryload()

Example

>>> # The previous example can be shorted if only a single value
>>> from ubelt.util_cache import Cacher
>>> depends = 'repr-of-params-that-uniquely-determine-the-process'
>>> # Create a cacher and try loading the data
>>> cacher = Cacher('demo_process', depends)
>>> myvar = cacher.tryload()
>>> if myvar is None:
>>>     myvar = ('result of expensive process', 'another result')
>>>     cacher.save(myvar)
>>> assert cacher.exists(), 'should now exist'
VERBOSE = 1
FORCE_DISABLE = False
property fpath
get_fpath(cfgstr=None)[source]

Reports the filepath that the cacher will use.

It will attempt to use ‘{fname}_{cfgstr}{ext}’ unless that is too long. Then cfgstr will be hashed.

Parameters

cfgstr (str | None) – overrides the instance-level cfgstr

Returns

str | PathLike

Example

>>> # xdoctest: +REQUIRES(module:pytest)
>>> from ubelt.util_cache import Cacher
>>> import pytest
>>> #with pytest.warns(UserWarning):
>>> if 1:  # we no longer warn here
>>>     cacher = Cacher('test_cacher1')
>>>     cacher.get_fpath()
>>> self = Cacher('test_cacher2', depends='cfg1')
>>> self.get_fpath()
>>> self = Cacher('test_cacher3', depends='cfg1' * 32)
>>> self.get_fpath()
exists(cfgstr=None)[source]

Check to see if the cache exists

Parameters

cfgstr (str | None) – overrides the instance-level cfgstr

Returns

bool

existing_versions()[source]

Returns data with different cfgstr values that were previously computed with this cacher.

Yields

str – paths to cached files corresponding to this cacher

Example

>>> from ubelt.util_cache import Cacher
>>> # Ensure that some data exists
>>> known_fpaths = set()
>>> import ubelt as ub
>>> dpath = ub.ensure_app_cache_dir('ubelt',
>>>                                 'test-existing-versions')
>>> ub.delete(dpath)  # start fresh
>>> cacher = Cacher('versioned_data_v2', depends='1', dpath=dpath)
>>> cacher.ensure(lambda: 'data1')
>>> known_fpaths.add(cacher.get_fpath())
>>> cacher = Cacher('versioned_data_v2', depends='2', dpath=dpath)
>>> cacher.ensure(lambda: 'data2')
>>> known_fpaths.add(cacher.get_fpath())
>>> # List previously computed configs for this type
>>> from os.path import basename
>>> cacher = Cacher('versioned_data_v2', depends='2', dpath=dpath)
>>> exist_fpaths = set(cacher.existing_versions())
>>> exist_fnames = list(map(basename, exist_fpaths))
>>> print('exist_fnames = {!r}'.format(exist_fnames))
>>> assert exist_fpaths.issubset(known_fpaths)
clear(cfgstr=None)[source]

Removes the saved cache and metadata from disk

Parameters

cfgstr (str | None) – overrides the instance-level cfgstr

tryload(cfgstr=None, on_error='raise')[source]

Like load, but returns None if the load fails due to a cache miss.

Parameters
  • cfgstr (str | None) – overrides the instance-level cfgstr

  • on_error (str, default=’raise’) – How to handle non-io errors errors. Either ‘raise’, which re-raises the exception, or ‘clear’ which deletes the cache and returns None.

Returns

the cached data if it exists, otherwise returns None

Return type

None | object

load(cfgstr=None)[source]

Load the data cached and raise an error if something goes wrong.

Parameters

cfgstr (str | None) – overrides the instance-level cfgstr

Returns

the cached data

Return type

object

Raises

IOError - if the data is unable to be loaded. This could be due to – a cache miss or because the cache is disabled.

Example

>>> from ubelt.util_cache import *  # NOQA
>>> # Setting the cacher as enabled=False turns it off
>>> cacher = Cacher('test_disabled_load', '', enabled=True)
>>> cacher.save('data')
>>> assert cacher.load() == 'data'
>>> cacher.enabled = False
>>> assert cacher.tryload() is None
save(data, cfgstr=None)[source]

Writes data to path specified by self.fpath.

Metadata containing information about the cache will also be appended to an adjacent file with the .meta suffix.

Parameters
  • data (object) – arbitrary pickleable object to be cached

  • cfgstr (str | None) – overrides the instance-level cfgstr

Example

>>> from ubelt.util_cache import *  # NOQA
>>> # Normal functioning
>>> depends = 'long-cfg' * 32
>>> cacher = Cacher('test_enabled_save', depends=depends)
>>> cacher.save('data')
>>> assert exists(cacher.get_fpath()), 'should be enabled'
>>> assert exists(cacher.get_fpath() + '.meta'), 'missing metadata'
>>> # Setting the cacher as enabled=False turns it off
>>> cacher2 = Cacher('test_disabled_save', 'params', enabled=False)
>>> cacher2.save('data')
>>> assert not exists(cacher2.get_fpath()), 'should be disabled'
ensure(func, *args, **kwargs)[source]

Wraps around a function. A cfgstr must be stored in the base cacher.

Parameters
  • func (Callable) – function that will compute data on cache miss

  • *args – passed to func

  • **kwargs – passed to func

Example

>>> from ubelt.util_cache import *  # NOQA
>>> def func():
>>>     return 'expensive result'
>>> fname = 'test_cacher_ensure'
>>> depends = 'func params'
>>> cacher = Cacher(fname, depends=depends)
>>> cacher.clear()
>>> data1 = cacher.ensure(func)
>>> data2 = cacher.ensure(func)
>>> assert data1 == 'expensive result'
>>> assert data1 == data2
>>> cacher.clear()
class ubelt.util_cache.CacheStamp(fname, dpath, cfgstr=None, product=None, hasher='sha1', verbose=None, enabled=True, depends=None, meta=None, hash_prefix=None, expires=None, ext='.pkl')[source]

Bases: object

Quickly determine if a file-producing computation has been done.

Check if the computation needs to be redone by calling expired. If the stamp is not expired, the user can expect that the results exist and could be loaded. If the stamp is expired, the computation should be redone. After the result is updated, the calls renew, which writes a “stamp” file to disk that marks that the procedure has been done.

There are several ways to control how a stamp expires. At a bare minimum, removing the stamp file will force expiration. However, in this circumstance CacheStamp only knows that something has been done, but it doesn’t have any information about what was done, so in general this is not sufficient.

To achieve more robust expiration behavior, the user should specify the product argument, which is a list of file paths that are expected to exist whenever the stamp is renewed. When this is specified the CacheStamp will expire if any of these products are deleted, their size changes, their modified timestamp changes, or their hash (i.e. checksum) changes. Note that by setting hasher=None, running and verifying checksums can be disabled.

If the user knows what the hash of the file should be this can be specified to prevent renewal of the stamp unless these match the files on disk. This can be useful for security purposes.

The stamp can also be set to expire at a specified time or after a specified duration using the expires argument.

Parameters
  • fname (str) – Name of the stamp file

  • dpath (str | PathLike | None) – Where to store the cached stamp file

  • product (str | PathLike | Sequence[str | PathLike] | None) – Path or paths that we expect the computation to produce. If specified the hash of the paths are stored.

  • hasher (str, default=’sha1’) – The type of hasher used to compute the file hash of product. If None, then we assume the file has not been corrupted or changed if the mtime and size are the same. Defaults to sha1.

  • verbose (bool, default=None) – Passed to internal ubelt.Cacher object

  • enabled (bool, default=True) – if False, expired always returns True

  • depends (str | List[str] | None) – Indicate dependencies of this cache. If the dependencies change, then the cache is recomputed. New to CacheStamp in version 0.9.2, replaces cfgstr.

  • meta (object | None) – Metadata that is also saved with the cfgstr. This can be useful to indicate how the cfgstr was constructed. New to CacheStamp in version 0.9.2. Note: this is a candidate for deprecation.

  • expires (str | int | datetime.datetime | datetime.timedelta | None) – If specified, sets an expiration date for the certificate. This can be an absolute datetime or a timedelta offset. If specified as an int, this is interpreted as a time delta in seconds. If specified as a str, this is interpreted as an absolute timestamp. Time delta offsets are coerced to absolute times at “renew” time.

  • hash_prefix (None | str | List[str]) – If specified, we verify that these match the hash(s) of the product(s) in the stamp certificate.

  • ext (str, default=’.pkl’) – File extension for the cache format. Can be '.pkl' or '.json'.

  • cfgstr (str | None) – DEPRECATED in favor or depends.

Notes

The size, mtime, and hash mechanism is similar to how Makefile and redo caches work.

Example

>>> import ubelt as ub
>>> # Stamp the computation of expensive-to-compute.txt
>>> dpath = ub.Path.appdir('ubelt/tests/cache-stamp')
>>> dpath.delete().ensuredir()
>>> product = dpath / 'expensive-to-compute.txt'
>>> self = ub.CacheStamp('somedata', depends='someconfig', dpath=dpath,
>>>                      product=product, hasher='sha256')
>>> self.clear()
>>> print(f'self.fpath={self.fpath}')
>>> if self.expired():
>>>     product.write_text('very expensive')
>>>     self.renew()
>>> assert not self.expired()
>>> # corrupting the output will cause the stamp to expire
>>> product.write_text('very corrupted')
>>> assert self.expired()
property fpath
clear()[source]

Delete the stamp (the products are untouched)

expired(cfgstr=None, product=None)[source]

Check to see if a previously existing stamp is still valid, if the expected result of that computation still exists, and if all other expiration criteria are met.

Parameters
  • cfgstr (str | None) – overrides the instance-level cfgstr. DEPRECATED do not use.

  • product (str | PathLike | Sequence[str | PathLike] | None) – override the default product if specified. DEPRECATED do not use.

Returns

True(-thy) if the stamp is invalid, expired, or does not exist. When the stamp is expired, the reason for expiration is returned as a string. If the stamp is still valid, False is returned.

Return type

bool | str

Example

>>> import ubelt as ub
>>> import time
>>> import os
>>> # Stamp the computation of expensive-to-compute.txt
>>> dpath = ub.Path.appdir('ubelt/tests/cache-stamp-expired')
>>> dpath.delete().ensuredir()
>>> products = [
>>>     dpath / 'product1.txt',
>>>     dpath / 'product2.txt',
>>> ]
>>> self = ub.CacheStamp('myname', depends='myconfig', dpath=dpath,
>>>                      product=products, hasher='sha256',
>>>                      expires=0)
>>> if self.expired():
>>>     for fpath in products:
>>>         fpath.write_text(fpath.name)
>>>     self.renew()
>>> fpath = products[0]
>>> # Because we set the expiration delta to 0, we should already be expired
>>> assert self.expired() == 'expired_cert'
>>> # Disable the expiration date, renew and we should be ok
>>> self.expires = None
>>> self.renew()
>>> assert not self.expired()
>>> # Modify the mtime to cause expiration
>>> orig_atime = fpath.stat().st_atime
>>> orig_mtime = fpath.stat().st_mtime
>>> os.utime(fpath, (orig_atime, orig_mtime + 200))
>>> assert self.expired() == 'mtime_diff'
>>> self.renew()
>>> assert not self.expired()
>>> # rewriting the file will cause the size constraint to fail
>>> # even if we hack the mtime to be the same
>>> orig_atime = fpath.stat().st_atime
>>> orig_mtime = fpath.stat().st_mtime
>>> fpath.write_text('corrupted')
>>> os.utime(fpath, (orig_atime, orig_mtime))
>>> assert self.expired() == 'size_diff'
>>> self.renew()
>>> assert not self.expired()
>>> # Force a situation where the hash is the only thing
>>> # that saves us, write a different file with the same
>>> # size and mtime.
>>> orig_atime = fpath.stat().st_atime
>>> orig_mtime = fpath.stat().st_mtime
>>> fpath.write_text('corrApted')
>>> os.utime(fpath, (orig_atime, orig_mtime))
>>> assert self.expired() == 'hash_diff'
>>> # Test what a wrong hash prefix causes expiration
>>> certificate = self.renew()
>>> self.hash_prefix = certificate['hash']
>>> self.expired()
>>> self.hash_prefix = ['bad', 'hashes']
>>> self.expired()
>>> # A bad hash will not allow us to renew
>>> import pytest
>>> with pytest.raises(RuntimeError):
...     self.renew()
renew(cfgstr=None, product=None)[source]

Recertify that the product has been recomputed by writing a new certificate to disk.

Returns

certificate information

Return type

dict