ubelt.util_hash module

Wrappers around hashlib functions to generate hash signatures for common data.

The hashes should be determenistic across platforms.

Note

The exact hashes generated for data object and files may change in the future. When this happens the HASH_VERSION attribute will be incremented.

ubelt.util_hash.hash_data(data, hasher=NoParam, hashlen=NoParam, base=NoParam)[source]

Get a unique hash depending on the state of the data.

Parameters:
  • data (object) – any sort of loosely organized data
  • hasher (HASH) – hash algorithm from hashlib, defaults to sha512.
  • hashlen (int) – maximum number of symbols in the returned hash. If not specified, all are returned.
  • base (list) – list of symbols or shorthand key. Defaults to base 26
Returns:

text - hash string

Return type:

str

Example

>>> print(hash_data([1, 2, (3, '4')], hashlen=8, hasher='sha512'))
iugjngof

frqkjbsq

ubelt.util_hash.hash_file(fpath, blocksize=65536, stride=1, hasher=NoParam, hashlen=NoParam, base=NoParam)[source]

Hashes the data in a file on disk.

Parameters:
  • fpath (str) – file path string
  • blocksize (int) – 2 ** 16. Affects speed of reading file
  • stride (int) – strides > 1 skip data to hash, useful for faster hashing, but less accurate, also makes hash dependant on blocksize.
  • hasher (HASH) – hash algorithm from hashlib, defaults to sha512.
  • hashlen (int) – maximum number of symbols in the returned hash. If not specified, all are returned.
  • base (list) – list of symbols or shorthand key. Defaults to base 26

Notes

For better hashes keep stride = 1 For faster hashes set stride > 1 blocksize matters when stride > 1

References

http://stackoverflow.com/questions/3431825/md5-checksum-of-a-file http://stackoverflow.com/questions/5001893/when-to-use-sha-1-vs-sha-2

Example

>>> import ubelt as ub
>>> from os.path import join
>>> fpath = join(ub.ensure_app_cache_dir('ubelt'), 'tmp.txt')
>>> ub.writeto(fpath, 'foobar')
>>> print(ub.hash_file(fpath, hasher='sha512', hashlen=8))
vkiodmcj