ubelt.util_hash module¶
Wrappers around hashlib functions to generate hash signatures for common data.
The hashes are deterministic across python versions and operating systems. This is verified by CI testing on Windows, Linux, Python with 2.7, 3.4, and greater, and on 32 and 64 bit versions.
Use Case #1: You have data that you want to hash. If we assume the data is in
standard python scalars or ordered sequences: e.g. tuple, list, odict, oset,
int, str, etc…, then the solution is hash_data()
.
Use Case #2: You have a file you want to hash, but your system doesn’t have a
sha1sum executable (or you dont want to use Popen). The solution is
hash_file()
The ubelt.util_hash.hash_data()
function recursively hashes most builtin
python data structures.
The ubelt.util_hash.hash_file()
function hashes data on disk. Both of
the aforementioned functions have options for different hashers and alphabets.
Example
>>> import ubelt as ub
>>> data = ub.odict(sorted({
>>> 'param1': True,
>>> 'param2': 0,
>>> 'param3': [None],
>>> 'param4': ('str', 4.2),
>>> }.items()))
>>> # hash_data can hash any ordered builtin object
>>> ub.hash_data(data, hasher='sha256')
0b101481e4b894ddf6de57...
Example
>>> import ubelt as ub
>>> from os.path import join
>>> fpath = ub.touch(join(ub.ensure_app_cache_dir('ubelt'), 'empty_file'))
>>> ub.hash_file(fpath, hasher='sha1')
da39a3ee5e6b4b0d3255bfef95601890afd80709
Note
The exact hashes generated for data object and files may change in the
future. When this happens the HASH_VERSION
attribute will be
incremented.
Note
[util_hash.Note.1] pre 0.10.2, the protected function
_hashable_sequence defaulted to types=True setting to True here for
backwards compat. This means that extensions using the
_hashable_sequence
helper will always include types in their hashable
encoding regardless of the argument setting. We may change this in the
future, to be more consistent. This is a minor detail unless you are
getting into the weeds of how we coerce technically non-hashable sequences
into a hashable encoding.
- ubelt.util_hash.hash_data(data, hasher=NoParam, base=NoParam, types=False, convert=False, extensions=None)[source]¶
Get a unique hash depending on the state of the data.
- Parameters
data (object) – Any sort of loosely organized data
hasher (str | Hasher | NoParamType) – string code or a hash algorithm from hashlib. Valid hashing algorithms are defined by
hashlib.algorithms_guaranteed
(e.g. ‘sha1’, ‘sha512’, ‘md5’) as well as ‘xxh32’ and ‘xxh64’ ifxxhash
is installed. Defaults to ‘sha512’.base (List[str] | str | NoParamType) – list of symbols or shorthand key. Valid keys are ‘abc’, ‘hex’, and ‘dec’. Defaults to ‘hex’
types (bool) – If True data types are included in the hash, otherwise only the raw data is hashed. Defaults to False.
convert (bool, default=True) – if True, try and convert the data to json an the json is hashed instead. This can improve runtime in some instances, however the hash may differ from the case where convert=False.
extensions (HashableExtensions) – a custom
HashableExtensions
instance that can overwrite or define how different types of objects are hashed.
Note
The types allowed are specified by the HashableExtensions object. By default ubelt will register:
OrderedDict, uuid.UUID, np.random.RandomState, np.int64, np.int32, np.int16, np.int8, np.uint64, np.uint32, np.uint16, np.uint8, np.float16, np.float32, np.float64, np.float128, np.ndarray, bytes, str, int, float, long (in python2), list, tuple, set, and dict
- Returns
text representing the hashed data
- Return type
Note
The alphabet26 base is a pretty nice base, I recommend it. However we default to
base='hex'
because it is standard. You can try the alphabet26 base by settingbase='abc'
.Example
>>> import ubelt as ub >>> print(ub.hash_data([1, 2, (3, '4')], convert=False)) 60b758587f599663931057e6ebdf185a... >>> print(ub.hash_data([1, 2, (3, '4')], base='abc', hasher='sha512')[:32]) hsrgqvfiuxvvhcdnypivhhthmrolkzej
- ubelt.util_hash.hash_file(fpath, blocksize=1048576, stride=1, maxbytes=None, hasher=NoParam, base=NoParam)[source]¶
Hashes the data in a file on disk.
The results of this function agree with the standard UNIX commands (e.g. sha1sum, sha512sum, md5sum, etc…)
- Parameters
fpath (PathLike) – location of the file to be hashed.
blocksize (int) – Amount of data to read and hash at a time. There is a trade off and the optimal number will depend on specific hardware. This number was chosen to be optimal on a developer system. See “dev/bench_hash_file” for methodology to choose this number for your use case. Defaults to 2 ** 20.
stride (int) – strides > 1 skip data to hash, useful for faster hashing, but less accurate, also makes hash dependent on blocksize. Defaults to 1.
maxbytes (int | None) – if specified, only hash the leading maxbytes of data in the file.
hasher (str | Hasher | NoParamType) – string code or a hash algorithm from hashlib. Valid hashing algorithms are defined by
hashlib.algorithms_guaranteed
(e.g. ‘sha1’, ‘sha512’, ‘md5’) as well as ‘xxh32’ and ‘xxh64’ ifxxhash
is installed. Defaults to ‘sha512’.TODO: add logic such that you can update an existing hasher
base (List[str] | str | NoParamType) – list of symbols or shorthand key. Valid keys are ‘abc’, ‘hex’, and ‘dec’. Defaults to ‘hex’.
Note
For better hashes keep stride = 1. For faster hashes set stride > 1. Blocksize matters when stride > 1.
References
- SO_3431825
http://stackoverflow.com/questions/3431825/md5-checksum-of-a-file
- SO_5001893
http://stackoverflow.com/questions/5001893/when-to-use-sha-1-vs-sha-2
Example
>>> import ubelt as ub >>> from os.path import join >>> dpath = ub.Path.appdir('ubelt/tests/test-hash').ensuredir() >>> fpath = dpath / 'tmp1.txt' >>> fpath.write_text('foobar') >>> print(ub.hash_file(fpath, hasher='sha1', base='hex')) 8843d7f92416211de9ebb963ff4ce28125932878
Example
>>> import ubelt as ub >>> dpath = ub.Path.appdir('ubelt/tests/test-hash').ensuredir() >>> fpath = dpath / 'tmp2.txt' >>> # We have the ability to only hash at most ``maxbytes`` in a file >>> fpath.write_text('abcdefghijklmnop') >>> h0 = ub.hash_file(fpath, hasher='sha1', base='hex', maxbytes=11, blocksize=3) >>> h1 = ub.hash_file(fpath, hasher='sha1', base='hex', maxbytes=32, blocksize=3) >>> h2 = ub.hash_file(fpath, hasher='sha1', base='hex', maxbytes=32, blocksize=32) >>> h3 = ub.hash_file(fpath, hasher='sha1', base='hex', maxbytes=16, blocksize=1) >>> h4 = ub.hash_file(fpath, hasher='sha1', base='hex', maxbytes=16, blocksize=18) >>> assert h1 == h2 == h3 == h4 >>> assert h1 != h0
>>> # Using a stride makes the result dependent on the blocksize >>> h0 = ub.hash_file(fpath, hasher='sha1', base='hex', maxbytes=11, blocksize=3, stride=2) >>> h1 = ub.hash_file(fpath, hasher='sha1', base='hex', maxbytes=32, blocksize=3, stride=2) >>> h2 = ub.hash_file(fpath, hasher='sha1', base='hex', maxbytes=32, blocksize=32, stride=2) >>> h3 = ub.hash_file(fpath, hasher='sha1', base='hex', maxbytes=16, blocksize=1, stride=2) >>> h4 = ub.hash_file(fpath, hasher='sha1', base='hex', maxbytes=16, blocksize=18, stride=2) >>> assert h1 != h2 != h3 >>> assert h1 == h0 >>> assert h2 == h4
Example
>>> import ubelt as ub >>> from os.path import join >>> dpath = ub.ensure_app_cache_dir('ubelt/tests/test-hash') >>> fpath = ub.touch(join(dpath, 'empty_file')) >>> # Test that the output is the same as sha1sum executable >>> if ub.find_exe('sha1sum'): >>> want = ub.cmd(['sha1sum', fpath], verbose=2)['out'].split(' ')[0] >>> got = ub.hash_file(fpath, hasher='sha1') >>> print('want = {!r}'.format(want)) >>> print('got = {!r}'.format(got)) >>> assert want.endswith(got) >>> # Do the same for sha512 sum and md5sum >>> if ub.find_exe('sha512sum'): >>> want = ub.cmd(['sha512sum', fpath], verbose=2)['out'].split(' ')[0] >>> got = ub.hash_file(fpath, hasher='sha512') >>> print('want = {!r}'.format(want)) >>> print('got = {!r}'.format(got)) >>> assert want.endswith(got) >>> if ub.find_exe('md5sum'): >>> want = ub.cmd(['md5sum', fpath], verbose=2)['out'].split(' ')[0] >>> got = ub.hash_file(fpath, hasher='md5') >>> print('want = {!r}'.format(want)) >>> print('got = {!r}'.format(got)) >>> assert want.endswith(got)