ubelt.util_download module

Helpers for downloading data

The download() function access the network and requests the content at a specific url using urllib or urllib2. You can either specify where the data goes or download it to the default location in ubelt cache. Either way this function returns the location of the downloaded data. You can also specify the expected hash in order to check the validity of the data. By default downloading is verbose.

The grabdata() function is almost identitcal to download(), but it checks if the data already exists in the download location, and only downloads if it needs to.

ubelt.util_download.download(url, fpath=None, dpath=None, fname=None, hash_prefix=None, hasher='sha512', chunksize=8192, verbose=1)[source]

Downloads a url to a file on disk.

If unspecified the location and name of the file is chosen automatically. A hash_prefix can be specified to verify the integrity of the downloaded data. This function will download the data every time its called. For cached downloading see grabdata.

Parameters
  • url (str) – The url to download.

  • fpath (PathLike | io.BytesIO) – The path to download to. Defaults to basename of url and ubelt’s application cache. If this is a io.BytesIO object then information is directly written to this object (note this prevents the use of temporary files).

  • dpath (PathLike) – where to download the file. If unspecified appname is used to determine this. Mutually exclusive with fpath.

  • fname (str) – What to name the downloaded file. Defaults to the url basename. Mutually exclusive with fpath.

  • hash_prefix (None | str) – If specified, download will retry / error if the file hash does not match this value. Defaults to None.

  • hasher (str | Hasher) – If hash_prefix is specified, this indicates the hashing algorithm to apply to the file. Defaults to sha512.

  • chunksize (int, default=2 * 13*) – Download chunksize.

  • verbose (int, default=1) – Verbosity level 0 or 1.

Returns

fpath - path to the downloaded file.

Return type

PathLike

Raises
  • URLError - if there is problem downloading the url

  • RuntimeError - if the hash does not match the hash_prefix

Notes

Based largely on code in pytorch 4 with modifications influenced by other resources 1 2 3.

References

1

http://blog.moleculea.com/2012/10/04/urlretrieve-progres-indicator/

2

http://stackoverflow.com/questions/15644964/python-progress-bar-and-downloads

3

http://stackoverflow.com/questions/16694907/how-to-download-large-file-in-python-with-requests-py

4

https://github.com/pytorch/pytorch/blob/2787f1d8edbd4aadd4a8680d204341a1d7112e2d/torch/hub.py#L347

Todo

  • [ ] fine-grained control of progress

Example

>>> # xdoctest: +REQUIRES(--network)
>>> from ubelt.util_download import *  # NOQA
>>> url = 'http://i.imgur.com/rqwaDag.png'
>>> fpath = download(url)
>>> print(basename(fpath))
rqwaDag.png

Example

>>> # xdoctest: +REQUIRES(--network)
>>> import ubelt as ub
>>> import io
>>> url = 'http://i.imgur.com/rqwaDag.png'
>>> file = io.BytesIO()
>>> fpath = ub.download(url, file)
>>> file.seek(0)
>>> data = file.read()
>>> assert ub.hash_data(data, hasher='sha1').startswith('f79ea24571')

Example

>>> # xdoctest: +REQUIRES(--network)
>>> url = 'http://i.imgur.com/rqwaDag.png'
>>> fpath = download(url, hasher='sha1', hash_prefix='f79ea24571da6ddd2ba12e3d57b515249ecb8a35')
Downloading url='http://i.imgur.com/rqwaDag.png' to fpath=...rqwaDag.png
...
...1233/1233... rate=... Hz, eta=..., total=...

Example

>>> # xdoctest: +REQUIRES(--network)
>>> # test download from girder
>>> import pytest
>>> import ubelt as ub
>>> url = 'https://data.kitware.com/api/v1/item/5b4039308d777f2e6225994c/download'
>>> ub.download(url, hasher='sha512', hash_prefix='c98a46cb31205cf')
>>> with pytest.raises(RuntimeError):
>>>     ub.download(url, hasher='sha512', hash_prefix='BAD_HASH')
ubelt.util_download.grabdata(url, fpath=None, dpath=None, fname=None, redo=False, verbose=1, appname=None, hash_prefix=None, hasher='sha512', **download_kw)[source]

Downloads a file, caches it, and returns its local path.

If unspecified the location and name of the file is chosen automatically. A hash_prefix can be specified to verify the integrity of the downloaded data.

Parameters
  • url (str) – url to the file to download

  • fpath (PathLike) – The full path to download the file to. If unspecified, the arguments dpath and fname are used to determine this.

  • dpath (PathLike) – where to download the file. If unspecified appname is used to determine this. Mutually exclusive with fpath.

  • fname (str) – What to name the downloaded file. Defaults to the url basename. Mutually exclusive with fpath.

  • redo (bool, default=False) – if True forces redownload of the file

  • verbose (bool, default=True) – verbosity flag

  • appname (str) – set dpath to ub.get_app_cache_dir(appname). Mutually exclusive with dpath and fpath.

  • hash_prefix (None | str) – If specified, grabdata verifies that this matches the hash of the file, and then saves the hash in a adjacent file to certify that the download was successful. Defaults to None.

  • hasher (str | Hasher) – If hash_prefix is specified, this indicates the hashing algorithm to apply to the file. Defaults to sha512.

  • **download_kw – additional kwargs to pass to ub.download

Returns

fpath - path to downloaded or cached file.

Return type

PathLike

CommandLine

xdoctest -m ubelt.util_download grabdata --network

Example

>>> # xdoctest: +REQUIRES(--network)
>>> import ubelt as ub
>>> url = 'http://i.imgur.com/rqwaDag.png'
>>> fpath = ub.grabdata(url, fname='mario.png')
>>> result = basename(fpath)
>>> print(result)
mario.png

Example

>>> # xdoctest: +REQUIRES(--network)
>>> import ubelt as ub
>>> fname = 'foo.bar'
>>> url = 'http://i.imgur.com/rqwaDag.png'
>>> prefix1 = '944389a39dfb8fa9'
>>> fpath = ub.grabdata(url, fname=fname, hash_prefix=prefix1)
>>> stamp_fpath = fpath + '.sha512.hash'
>>> assert ub.readfrom(stamp_fpath) == prefix1
>>> # Check that the download doesn't happen again
>>> fpath = ub.grabdata(url, fname=fname, hash_prefix=prefix1)
>>> # todo: check file timestamps have not changed
>>> #
>>> # Check redo works with hash
>>> fpath = ub.grabdata(url, fname=fname, hash_prefix=prefix1, redo=True)
>>> # todo: check file timestamps have changed
>>> #
>>> # Check that a redownload occurs when the stamp is changed
>>> open(stamp_fpath, 'w').write('corrupt-stamp')
>>> fpath = ub.grabdata(url, fname=fname, hash_prefix=prefix1)
>>> assert ub.readfrom(stamp_fpath) == prefix1
>>> #
>>> # Check that a redownload occurs when the stamp is removed
>>> ub.delete(stamp_fpath)
>>> open(fpath, 'w').write('corrupt-data')
>>> assert not ub.hash_file(fpath, base='hex', hasher='sha512').startswith(prefix1)
>>> fpath = ub.grabdata(url, fname=fname, hash_prefix=prefix1)
>>> assert ub.hash_file(fpath, base='hex', hasher='sha512').startswith(prefix1)
>>> #
>>> # Check that requesting new data causes redownload
>>> url2 = 'https://data.kitware.com/api/v1/item/5b4039308d777f2e6225994c/download'
>>> prefix2 = 'c98a46cb31205cf'
>>> fpath = ub.grabdata(url2, fname=fname, hash_prefix=prefix2)
>>> assert ub.readfrom(stamp_fpath) == prefix2