ubelt.util_hash module

Wrappers around hashlib functions to generate hash signatures for common data.

The hashes are deterministic across python versions and operating systems. This is verified by CI testing on Windows, Linux, Python with 2.7, 3.4, and greater, and on 32 and 64 bit versions.

Use Case:

Problem: You have data that you want to hash. Assumptions: The data is in standard python scalars or ordered sequences:

e.g. tuple, list, odict, oset, int, str, etc…

Solution: ub.hash_data

Example

>>> import ubelt as ub
>>> data = ub.odict(sorted({
>>>     'param1': True,
>>>     'param2': 0,
>>>     'param3': [None],
>>>     'param4': ('str', 4.2),
>>> }.items()))
>>> # hash_data can hash any ordered builtin object
>>> ub.hash_data(data, convert=False, hasher='sha512')
2ff39d0ecbf6ecc740ca7d...
Use Case:
Problem: You have a file you want to hash, but your system doesn’t have
a sha1sum executable (or you dont want to use Popen).

Solution: ub.hash_file

Example

>>> import ubelt as ub
>>> from os.path import join
>>> fpath = ub.touch(join(ub.ensure_app_cache_dir('ubelt'), 'empty_file'))
>>> ub.hash_file(fpath, hasher='sha1')
da39a3ee5e6b4b0d3255bfef95601890afd80709

Note

The exact hashes generated for data object and files may change in the future. When this happens the HASH_VERSION attribute will be incremented.

ubelt.util_hash.hash_data(data, hasher=NoParam, base=NoParam, types=False, hashlen=NoParam, convert=False)[source]

Get a unique hash depending on the state of the data.

Parameters:
  • data (object) – Any sort of loosely organized data
  • hasher (str or HASHER) – Hash algorithm from hashlib, defaults to sha512.
  • base (str or List[str]) – Shorthand key or a list of symbols. Valid keys are: ‘abc’, ‘hex’, and ‘dec’. Defaults to ‘hex’.
  • types (bool) – If True data types are included in the hash, otherwise only the raw data is hashed. Defaults to False.
  • hashlen (int) – Maximum number of symbols in the returned hash. If not specified, all are returned. DEPRECATED. Use slice syntax instead.
  • convert (bool, optional, default=True) – if True, try and convert the data to json an the json is hashed instead. This can improve runtime in some instances, however the hash may differ from the case where convert=False.

Notes

alphabet26 is a pretty nice base, I recommend it. However we default to hex because it is standard. This means the output of hashdata with base=sha1 will be the same as the output of sha1sum.

Returns:text - hash string
Return type:str

Example

>>> import ubelt as ub
>>> print(ub.hash_data([1, 2, (3, '4')], convert=False))
60b758587f599663931057e6ebdf185a...
>>> print(ub.hash_data([1, 2, (3, '4')], base='abc',  hasher='sha512')[:32])
hsrgqvfiuxvvhcdnypivhhthmrolkzej
ubelt.util_hash.hash_file(fpath, blocksize=65536, stride=1, hasher=NoParam, hashlen=NoParam, base=NoParam)[source]

Hashes the data in a file on disk.

Parameters:
  • fpath (PathLike) – file path string
  • blocksize (int) – 2 ** 16. Affects speed of reading file
  • stride (int) – strides > 1 skip data to hash, useful for faster hashing, but less accurate, also makes hash dependant on blocksize.
  • hasher (HASH) – hash algorithm from hashlib, defaults to sha512.
  • hashlen (int) – maximum number of symbols in the returned hash. If not specified, all are returned.
  • base (list, str) – list of symbols or shorthand key. Valid keys are ‘abc’, ‘hex’, and ‘dec’. Defaults to ‘hex’.

Notes

For better hashes keep stride = 1 For faster hashes set stride > 1 blocksize matters when stride > 1

References

http://stackoverflow.com/questions/3431825/md5-checksum-of-a-file http://stackoverflow.com/questions/5001893/when-to-use-sha-1-vs-sha-2

Example

>>> import ubelt as ub
>>> from os.path import join
>>> fpath = join(ub.ensure_app_cache_dir('ubelt'), 'tmp.txt')
>>> ub.writeto(fpath, 'foobar')
>>> print(ub.hash_file(fpath, hasher='sha1', base='hex'))
8843d7f92416211de9ebb963ff4ce28125932878

Example

>>> import ubelt as ub
>>> from os.path import join
>>> fpath = ub.touch(join(ub.ensure_app_cache_dir('ubelt'), 'empty_file'))
>>> # Test that the output is the same as sha1sum
>>> if ub.find_exe('sha1sum'):
>>>     want = ub.cmd(['sha1sum', fpath], verbose=2)['out'].split(' ')[0]
>>>     got = ub.hash_file(fpath, hasher='sha1')
>>>     print('want = {!r}'.format(want))
>>>     print('got = {!r}'.format(got))
>>>     assert want.endswith(got)
>>> # Do the same for sha512 sum and md5sum
>>> if ub.find_exe('sha512sum'):
>>>     want = ub.cmd(['sha512sum', fpath], verbose=2)['out'].split(' ')[0]
>>>     got = ub.hash_file(fpath, hasher='sha512')
>>>     print('want = {!r}'.format(want))
>>>     print('got = {!r}'.format(got))
>>>     assert want.endswith(got)
>>> if ub.find_exe('md5sum'):
>>>     want = ub.cmd(['md5sum', fpath], verbose=2)['out'].split(' ')[0]
>>>     got = ub.hash_file(fpath, hasher='md5')
>>>     print('want = {!r}'.format(want))
>>>     print('got = {!r}'.format(got))
>>>     assert want.endswith(got)