ubelt.util_cache module¶
This module exposes Cacher
and CacheStamp
classes, which
provide a simple API for on-disk caching.
The Cacher
class is the simplest and most direct method of caching. In
fact, it only requires four lines of boilerplate, which is the smallest
general and robust way that I (Jon Crall) have achieved, and I don’t think its
possible to do better. These four lines implement the following necessary and
sufficient steps for general robust on-disk caching.
Defining the cache dependencies
Checking if the cache missed
Loading the cache on a hit
Executing the process and saving the result on a miss.
The following example illustrates these four points.
Example
>>> import ubelt as ub
>>> # Define a cache name and dependencies (which is fed to `ub.hash_data`)
>>> cacher = ub.Cacher('name', depends='set-of-deps') # boilerplate:1
>>> # Calling tryload will return your data on a hit and None on a miss
>>> data = cacher.tryload(on_error='clear') # boilerplate:2
>>> # Check if you need to recompute your data
>>> if data is None: # boilerplate:3
>>> # Your code to recompute data goes here (this is not boilerplate).
>>> data = 'mydata'
>>> # Cache the computation result (via pickle)
>>> cacher.save(data) # boilerplate:4
Surprisingly this uses just as many boilerplate lines as a decorator style
cacher, but it is much more extensible. It is possible to use Cacher
in more sophisticated ways (e.g. metadata), but the simple in-line use is often
easier and cleaner. The following example illustrates this:
Example
>>> import ubelt as ub
>>> @ub.Cacher('name', depends='set-of-deps') # boilerplate:1
>>> def func(): # boilerplate:2
>>> data = 'mydata'
>>> return data # boilerplate:3
>>> data = func() # boilerplate:4
>>> cacher = ub.Cacher('name', depends='set-of-deps') # boilerplate:1
>>> data = cacher.tryload(on_error='clear') # boilerplate:2
>>> if data is None: # boilerplate:3
>>> data = 'mydata'
>>> cacher.save(data) # boilerplate:4
While the above two are equivalent, the second version provides a simpler traceback, explicit procedures, and makes it easier to use breakpoint debugging (because there is no closure scope).
While Cacher
is used to store direct results of in-line code in a
pickle format, the CacheStamp
object is used to cache processes that
produces an on-disk side effects other than the main return value. For
instance, consider the following example:
Example
>>> import ubelt as ub
>>> def compute_many_files(dpath):
... for i in range(10):
... fpath = '{}/file{}.txt'.format(dpath, i)
... with open(fpath, 'w') as file:
... file.write('foo' + str(i))
>>> dpath = ub.Path.appdir('ubelt/demo/cache').delete().ensuredir()
>>> # You must specify a directory, unlike in Cacher where it is optional
>>> self = ub.CacheStamp('name', dpath=dpath, depends={'a': 1, 'b': 2})
>>> if self.expired():
>>> compute_many_files(dpath)
>>> # Instead of caching the whole processes, we just write a file
>>> # that signals the process has been done.
>>> self.renew()
>>> assert not self.expired()
The CacheStamp is lightweight in that it simply marks that a process has been
completed, but the job of saving / loading the actual data is left to the
developer. The expired
method checks if the stamp exists, and renew
writes the stamp to disk.
In ubelt version 1.1.0, several additional features were added to CacheStamp.
In addition to specifying parameters via depends
, it is also possible for
CacheStamp to determine if an associated file has been modified. To do this,
the paths of the files must be known a-priori and passed to CacheStamp via the
product
argument. This will allow the CacheStamp to detect if the files
have been modified since the renew
method was called. It does this by
remembering the size, modified time, and checksum of each file. If the hash of
the expected hash of the product is known in advance, it is also possible to
specify the expected hash_prefix
of each product. In this case, renew
will raise an Exception if this specified hash prefix does not match the files
on disk. Lastly, it is possible to specify an expiration time via expires
,
after which the CacheStamp will always be marked as invalid. This is now the
mechanism via which the cache in ubelt.util_download.grabdata()
works.
Example
>>> import ubelt as ub
>>> dpath = ub.Path.appdir('ubelt/demo/cache').delete().ensuredir()
>>> params = {'a': 1, 'b': 2}
>>> expected_fpaths = [dpath / 'file{}.txt'.format(i) for i in range(2)]
>>> hash_prefix = ['a7a8a91659601590e17191301dc1',
... '55ae75d991c770d8f3ef07cbfde1']
>>> self = ub.CacheStamp('name', dpath=dpath, depends=params,
>>> hash_prefix=hash_prefix, hasher='sha256',
>>> product=expected_fpaths, expires='2101-01-01T000000Z')
>>> if self.expired():
>>> for fpath in expected_fpaths:
... fpath.write_text(fpath.name)
>>> self.renew()
>>> # modifying or removing the file will cause the stamp to expire
>>> expected_fpaths[0].write_text('corrupted')
>>> assert self.expired()
- RelatedWork:
- class ubelt.util_cache.Cacher(fname, depends=None, dpath=None, appname='ubelt', ext='.pkl', meta=None, verbose=None, enabled=True, log=None, hasher='sha1', protocol=-1, cfgstr=None, backend='auto')[source]¶
Bases:
object
Saves data to disk and reloads it based on specified dependencies.
Cacher uses pickle to save/load data to/from disk. Dependencies of the cached process can be specified, which ensures the cached data is recomputed if the dependencies change. If the location of the cache is not specified, it will default to the system user’s cache directory.
- Related:
..[JobLibMemory] https://joblib.readthedocs.io/en/stable/memory.html
Example
>>> import ubelt as ub >>> depends = 'repr-of-params-that-uniquely-determine-the-process' >>> # Create a cacher and try loading the data >>> cacher = ub.Cacher('demo_process', depends, verbose=4) >>> cacher.clear() >>> print(f'cacher.fpath={cacher.fpath}') >>> data = cacher.tryload() >>> if data is None: >>> # Put expensive functions in if block when cacher misses >>> myvar1 = 'result of expensive process' >>> myvar2 = 'another result' >>> # Tell the cacher to write at the end of the if block >>> # It is idomatic to put results in an object named data >>> data = myvar1, myvar2 >>> cacher.save(data) >>> # Last part of the Cacher pattern is to unpack the data object >>> myvar1, myvar2 = data >>> # >>> # If we know the data exists, we can also simply call load >>> data = cacher.tryload()
Example
>>> # The previous example can be shorted if only a single value >>> from ubelt.util_cache import Cacher >>> depends = 'repr-of-params-that-uniquely-determine-the-process' >>> # Create a cacher and try loading the data >>> cacher = Cacher('demo_process', depends) >>> myvar = cacher.tryload() >>> if myvar is None: >>> myvar = ('result of expensive process', 'another result') >>> cacher.save(myvar) >>> assert cacher.exists(), 'should now exist'
- Parameters:
fname (str) – A file name. This is the prefix that will be used by the cache. It will always be used as-is.
depends (str | List[str] | None) – Indicate dependencies of this cache. If the dependencies change, then the cache is recomputed. New in version 0.8.9, replaces
cfgstr
.dpath (str | PathLike | None) – Specifies where to save the cache. If unspecified, Cacher defaults to an application cache dir as given by appname. See
ub.get_app_cache_dir()
for more details.appname (str) – Application name Specifies a folder in the application cache directory where to cache the data if
dpath
is not specified. Defaults to ‘ubelt’.ext (str) – File extension for the cache format. Can be
'.pkl'
or'.json'
. Defaults to'.pkl'
.meta (object | None) – Metadata that is also saved with the
cfgstr
. This can be useful to indicate how thecfgstr
was constructed. Note: this is a candidate for deprecation.verbose (int) – Level of verbosity. Can be 1, 2 or 3. Defaults to 1.
enabled (bool) – If set to False, then the load and save methods will do nothing. Defaults to True.
log (Callable[[str], Any]) – Overloads the print function. Useful for sending output to loggers (e.g. logging.info, tqdm.tqdm.write, …)
hasher (str) – Type of hashing algorithm to use if
cfgstr
needs to be condensed to less than 49 characters. Defaults to sha1.protocol (int) – Protocol version used by pickle. Defaults to the -1 which is the latest protocol.
backend (str) – Set to either
'pickle'
or'json'
to force backend. Defaults to auto which chooses one based on the extension.cfgstr (str | None) – Deprecated in favor of
depends
.
- VERBOSE = 1¶
- FORCE_DISABLE = False¶
- get_fpath(cfgstr=None)[source]¶
Reports the filepath that the cacher will use.
It will attempt to use ‘{fname}_{cfgstr}{ext}’ unless that is too long. Then cfgstr will be hashed.
- Parameters:
cfgstr (str | None) – overrides the instance-level cfgstr
- Returns:
str | PathLike
Example
>>> # xdoctest: +REQUIRES(module:pytest) >>> from ubelt.util_cache import Cacher >>> import pytest >>> #with pytest.warns(UserWarning): >>> if 1: # we no longer warn here >>> cacher = Cacher('test_cacher1') >>> cacher.get_fpath() >>> self = Cacher('test_cacher2', depends='cfg1') >>> self.get_fpath() >>> self = Cacher('test_cacher3', depends='cfg1' * 32) >>> self.get_fpath()
- exists(cfgstr=None)[source]¶
Check to see if the cache exists
- Parameters:
cfgstr (str | None) – overrides the instance-level cfgstr
- Returns:
bool
- existing_versions()[source]¶
Returns data with different cfgstr values that were previously computed with this cacher.
- Yields:
str – paths to cached files corresponding to this cacher
Example
>>> # Ensure that some data exists >>> import ubelt as ub >>> dpath = ub.Path.appdir( >>> 'ubelt/tests/util_cache', >>> 'test-existing-versions').delete().ensuredir() >>> cacher = ub.Cacher('versioned_data_v2', depends='1', dpath=dpath) >>> cacher.ensure(lambda: 'data1') >>> known_fpaths = set() >>> known_fpaths.add(cacher.get_fpath()) >>> cacher = ub.Cacher('versioned_data_v2', depends='2', dpath=dpath) >>> cacher.ensure(lambda: 'data2') >>> known_fpaths.add(cacher.get_fpath()) >>> # List previously computed configs for this type >>> from os.path import basename >>> cacher = ub.Cacher('versioned_data_v2', depends='2', dpath=dpath) >>> exist_fpaths = set(cacher.existing_versions()) >>> exist_fnames = list(map(basename, exist_fpaths)) >>> print('exist_fnames = {!r}'.format(exist_fnames)) >>> print('exist_fpaths = {!r}'.format(exist_fpaths)) >>> print('known_fpaths={!r}'.format(known_fpaths)) >>> assert exist_fpaths.issubset(known_fpaths)
- clear(cfgstr=None)[source]¶
Removes the saved cache and metadata from disk
- Parameters:
cfgstr (str | None) – overrides the instance-level cfgstr
- tryload(cfgstr=None, on_error='raise')[source]¶
Like load, but returns None if the load fails due to a cache miss.
- Parameters:
cfgstr (str | None) – overrides the instance-level cfgstr
on_error (str) – How to handle non-io errors errors. Either ‘raise’, which re-raises the exception, or ‘clear’ which deletes the cache and returns None. Defaults to ‘raise’.
- Returns:
the cached data if it exists, otherwise returns None
- Return type:
None | object
- load(cfgstr=None)[source]¶
Load the data cached and raise an error if something goes wrong.
- Parameters:
cfgstr (str | None) – overrides the instance-level cfgstr
- Returns:
the cached data
- Return type:
- Raises:
IOError - if the data is unable to be loaded. This could be due to – a cache miss or because the cache is disabled.
Example
>>> from ubelt.util_cache import * # NOQA >>> # Setting the cacher as enabled=False turns it off >>> cacher = Cacher('test_disabled_load', '', enabled=True, >>> appname='ubelt/tests/util_cache') >>> cacher.save('data') >>> assert cacher.load() == 'data' >>> cacher.enabled = False >>> assert cacher.tryload() is None
- save(data, cfgstr=None)[source]¶
Writes data to path specified by
self.fpath
.Metadata containing information about the cache will also be appended to an adjacent file with the .meta suffix.
- Parameters:
data (object) – arbitrary pickleable object to be cached
cfgstr (str | None) – overrides the instance-level cfgstr
Example
>>> from ubelt.util_cache import * # NOQA >>> # Normal functioning >>> depends = 'long-cfg' * 32 >>> cacher = Cacher('test_enabled_save', depends=depends, >>> appname='ubelt/tests/util_cache') >>> cacher.save('data') >>> assert exists(cacher.get_fpath()), 'should be enabled' >>> assert exists(cacher.get_fpath() + '.meta'), 'missing metadata' >>> # Setting the cacher as enabled=False turns it off >>> cacher2 = Cacher('test_disabled_save', 'params', enabled=False, >>> appname='ubelt/tests/util_cache') >>> cacher2.save('data') >>> assert not exists(cacher2.get_fpath()), 'should be disabled'
- _backend_load(data_fpath)[source]¶
Example
>>> import ubelt as ub >>> cacher = ub.Cacher('test_other_backend', depends=['a'], ext='.json') >>> cacher.save(['data']) >>> cacher.tryload()
>>> import ubelt as ub >>> cacher = ub.Cacher('test_other_backend2', depends=['a'], ext='.yaml', backend='json') >>> cacher.save({'data': [1, 2, 3]}) >>> cacher.tryload()
>>> import pytest >>> with pytest.raises(ValueError): >>> ub.Cacher('test_other_backend2', depends=['a'], ext='.yaml', backend='does-not-exist') >>> cacher = ub.Cacher('test_other_backend2', depends=['a'], ext='.really-a-pickle', backend='auto') >>> assert cacher.backend == 'pickle', 'should be default'
- ensure(func, *args, **kwargs)[source]¶
Wraps around a function. A cfgstr must be stored in the base cacher.
- Parameters:
func (Callable) – function that will compute data on cache miss
*args – passed to func
**kwargs – passed to func
Example
>>> from ubelt.util_cache import * # NOQA >>> def func(): >>> return 'expensive result' >>> fname = 'test_cacher_ensure' >>> depends = 'func params' >>> cacher = Cacher(fname, depends=depends) >>> cacher.clear() >>> data1 = cacher.ensure(func) >>> data2 = cacher.ensure(func) >>> assert data1 == 'expensive result' >>> assert data1 == data2 >>> cacher.clear()
- class ubelt.util_cache.CacheStamp(fname, dpath, cfgstr=None, product=None, hasher='sha1', verbose=None, enabled=True, depends=None, meta=None, hash_prefix=None, expires=None, ext='.pkl')[source]¶
Bases:
object
Quickly determine if a file-producing computation has been done.
Check if the computation needs to be redone by calling
expired
. If the stamp is not expired, the user can expect that the results exist and could be loaded. If the stamp is expired, the computation should be redone. After the result is updated, the callsrenew
, which writes a “stamp” file to disk that marks that the procedure has been done.There are several ways to control how a stamp expires. At a bare minimum, removing the stamp file will force expiration. However, in this circumstance CacheStamp only knows that something has been done, but it doesn’t have any information about what was done, so in general this is not sufficient.
To achieve more robust expiration behavior, the user should specify the
product
argument, which is a list of file paths that are expected to exist whenever the stamp is renewed. When this is specified the CacheStamp will expire if any of these products are deleted, their size changes, their modified timestamp changes, or their hash (i.e. checksum) changes. Note that by settinghasher=None
, running and verifying checksums can be disabled.If the user knows what the hash of the file should be this can be specified to prevent renewal of the stamp unless these match the files on disk. This can be useful for security purposes.
The stamp can also be set to expire at a specified time or after a specified duration using the
expires
argument.Notes
The size, mtime, and hash mechanism is similar to how Makefile and redo caches work.
- Variables:
cacher (Cacher) – underlying cacher object
Example
>>> import ubelt as ub >>> # Stamp the computation of expensive-to-compute.txt >>> dpath = ub.Path.appdir('ubelt/tests/cache-stamp') >>> dpath.delete().ensuredir() >>> product = dpath / 'expensive-to-compute.txt' >>> self = ub.CacheStamp('somedata', depends='someconfig', dpath=dpath, >>> product=product, hasher='sha256') >>> self.clear() >>> print(f'self.fpath={self.fpath}') >>> if self.expired(): >>> product.write_text('very expensive') >>> self.renew() >>> assert not self.expired() >>> # corrupting the output will cause the stamp to expire >>> product.write_text('very corrupted') >>> assert self.expired()
- Parameters:
fname (str) – Name of the stamp file
dpath (str | PathLike | None) – Where to store the cached stamp file
product (str | PathLike | Sequence[str | PathLike] | None) – Path or paths that we expect the computation to produce. If specified the hash of the paths are stored.
hasher (str) – The type of hasher used to compute the file hash of product. If None, then we assume the file has not been corrupted or changed if the mtime and size are the same. Defaults to sha1.
verbose (bool | None) – Passed to internal
ubelt.Cacher
object. Defaults to None.enabled (bool) – if False, expired always returns True. Defaults to True.
depends (str | List[str] | None) – Indicate dependencies of this cache. If the dependencies change, then the cache is recomputed. New to CacheStamp in version 0.9.2.
meta (object | None) – Metadata that is also saved as a sidecar file. New to CacheStamp in version 0.9.2. Note: this is a candidate for deprecation.
expires (str | int | datetime.datetime | datetime.timedelta | None) – If specified, sets an expiration date for the certificate. This can be an absolute datetime or a timedelta offset. If specified as an int, this is interpreted as a time delta in seconds. If specified as a str, this is interpreted as an absolute timestamp. Time delta offsets are coerced to absolute times at “renew” time.
hash_prefix (None | str | List[str]) – If specified, we verify that these match the hash(s) of the product(s) in the stamp certificate.
ext (str) – File extension for the cache format. Can be
'.pkl'
or'.json'
. Defaults to'.pkl'
.cfgstr (str | None) – DEPRECATED.
- property fpath¶
- expired(cfgstr=None, product=None)[source]¶
Check to see if a previously existing stamp is still valid, if the expected result of that computation still exists, and if all other expiration criteria are met.
- Parameters:
cfgstr (Any) – DEPRECATED
product (Any) – DEPRECATED
- Returns:
True(-thy) if the stamp is invalid, expired, or does not exist. When the stamp is expired, the reason for expiration is returned as a string. If the stamp is still valid, False is returned.
- Return type:
Example
>>> import ubelt as ub >>> import time >>> import os >>> # Stamp the computation of expensive-to-compute.txt >>> dpath = ub.Path.appdir('ubelt/tests/cache-stamp-expired') >>> dpath.delete().ensuredir() >>> products = [ >>> dpath / 'product1.txt', >>> dpath / 'product2.txt', >>> ] >>> self = ub.CacheStamp('myname', depends='myconfig', dpath=dpath, >>> product=products, hasher='sha256', >>> expires=0) >>> if self.expired(): >>> for fpath in products: >>> fpath.write_text(fpath.name) >>> self.renew() >>> fpath = products[0] >>> # Because we set the expiration delta to 0, we should already be expired >>> assert self.expired() == 'expired_cert' >>> # Disable the expiration date, renew and we should be ok >>> self.expires = None >>> self.renew() >>> assert not self.expired() >>> # Modify the mtime to cause expiration >>> orig_atime = fpath.stat().st_atime >>> orig_mtime = fpath.stat().st_mtime >>> os.utime(fpath, (orig_atime, orig_mtime + 200)) >>> assert self.expired() == 'mtime_diff' >>> self.renew() >>> assert not self.expired() >>> # rewriting the file will cause the size constraint to fail >>> # even if we hack the mtime to be the same >>> orig_atime = fpath.stat().st_atime >>> orig_mtime = fpath.stat().st_mtime >>> fpath.write_text('corrupted') >>> os.utime(fpath, (orig_atime, orig_mtime)) >>> assert self.expired() == 'size_diff' >>> self.renew() >>> assert not self.expired() >>> # Force a situation where the hash is the only thing >>> # that saves us, write a different file with the same >>> # size and mtime. >>> orig_atime = fpath.stat().st_atime >>> orig_mtime = fpath.stat().st_mtime >>> fpath.write_text('corrApted') >>> os.utime(fpath, (orig_atime, orig_mtime)) >>> assert self.expired() == 'hash_diff' >>> # Test what a wrong hash prefix causes expiration >>> certificate = self.renew() >>> self.hash_prefix = certificate['hash'] >>> self.expired() >>> self.hash_prefix = ['bad', 'hashes'] >>> self.expired() >>> # A bad hash will not allow us to renew >>> import pytest >>> with pytest.raises(RuntimeError): ... self.renew()
- _expires(now=None)[source]¶
- Returns:
the absolute local time when the stamp expires
- Return type:
Example
>>> import ubelt as ub >>> dpath = ub.Path.appdir('ubelt/tests/cache-stamp-expires') >>> self = ub.CacheStamp('myname', depends='myconfig', dpath=dpath) >>> # Test str input >>> self.expires = '2020-01-01T000000Z' >>> assert self._expires().replace(tzinfo=None).isoformat() == '2020-01-01T00:00:00' >>> # Test datetime input >>> dt = ub.timeparse(ub.timestamp()) >>> self.expires = dt >>> assert self._expires() == dt >>> # Test None input >>> self.expires = None >>> assert self._expires() is None >>> # Test int input >>> self.expires = 0 >>> assert self._expires(dt) == dt >>> self.expires = 10 >>> assert self._expires(dt) > dt >>> self.expires = -10 >>> assert self._expires(dt) < dt >>> # Test timedelta input >>> import datetime as datetime_mod >>> self.expires = datetime_mod.timedelta(seconds=-10) >>> assert self._expires(dt) == dt + self.expires
- _new_certificate(cfgstr=None, product=None)[source]¶
- Returns:
certificate information
- Return type:
Example
>>> import ubelt as ub >>> # Stamp the computation of expensive-to-compute.txt >>> dpath = ub.Path.appdir('ubelt/tests/cache-stamp-cert').ensuredir() >>> product = dpath / 'product1.txt' >>> product.write_text('hi') >>> self = ub.CacheStamp('myname', depends='myconfig', dpath=dpath, >>> product=product) >>> cert = self._new_certificate() >>> assert cert['expires'] is None >>> self.expires = '2020-01-01T000000' >>> self.renew() >>> cert = self._new_certificate() >>> assert cert['expires'] is not None
- renew(cfgstr=None, product=None)[source]¶
Recertify that the product has been recomputed by writing a new certificate to disk.
- Parameters:
cfgstr (None | str) – deprecated, do not use.
product (None | str | List) – deprecated, do not use.
- Returns:
certificate information if enabled otherwise None.
- Return type:
None | dict
Example
>>> # Test that renew does nothing when the cacher is disabled >>> import ubelt as ub >>> dpath = ub.Path.appdir('ubelt/tests/cache-stamp-renew').ensuredir() >>> self = ub.CacheStamp('foo', dpath=dpath, enabled=False) >>> assert self.renew() is None
- ubelt.util_cache._byte_str(num, unit='auto', precision=2)[source]¶
Automatically chooses relevant unit (KB, MB, or GB) for displaying some number of bytes.
- Parameters:
num (int) – number of bytes
unit (str) – which unit to use, can be auto, B, KB, MB, GB, or TB
References
[WikiOrdersOfMag]- Returns:
string representing the number of bytes with appropriate units
- Return type:
Example
>>> from ubelt.util_cache import _byte_str >>> import ubelt as ub >>> num_list = [1, 100, 1024, 1048576, 1073741824, 1099511627776] >>> result = ub.urepr(list(map(_byte_str, num_list)), nl=0) >>> print(result) ['0.00KB', '0.10KB', '1.00KB', '1.00MB', '1.00GB', '1.00TB'] >>> _byte_str(10, unit='B') 10.00B