ubelt.util_download module¶
Helpers for downloading data
The download()
function access the network and requests the content at a
specific url using urllib
. You can either specify where the data goes or
download it to the default location in ubelt cache. Either way this function
returns the location of the downloaded data. You can also specify the expected
hash in order to check the validity of the data. By default downloading is
verbose.
The grabdata()
function is almost identitcal to download()
, but it
checks if the data already exists in the download location, and only downloads
if it needs to.
- ubelt.util_download.download(url, fpath=None, dpath=None, fname=None, appname=None, hash_prefix=None, hasher='sha512', chunksize=8192, filesize=None, verbose=1, timeout=NoParam, progkw=None, requestkw=None)[source]¶
Downloads a url to a file on disk and returns the path.
If unspecified the location and name of the file is chosen automatically. A hash_prefix can be specified to verify the integrity of the downloaded data. This function will download the data every time its called. For cached downloading see
grabdata()
.- Parameters:
url (str) – The url to download.
fpath (str | PathLike | io.BytesIO | None) – The path to download to. Defaults to basename of url and ubelt’s application cache. If this is a
io.BytesIO
object then information is directly written to this object (note this prevents the use of temporary files).dpath (str | PathLike | None) – where to download the file. If unspecified appname is used to determine this. Mutually exclusive with fpath.
fname (str | None) – What to name the downloaded file. Defaults to the url basename. Mutually exclusive with fpath.
appname (str | None) – set dpath to
ub.Path.appdir(appname or 'ubelt', type='cache')
if dpath and fpath are not given.hash_prefix (None | str) – If specified, download will retry / error if the file hash does not match this value. Defaults to None.
hasher (str | Hasher) – If hash_prefix is specified, this indicates the hashing algorithm to apply to the file. Defaults to sha512.
chunksize (int) – Download chunksize in bytes. Default to
2 ** 13
filesize (int | None) – If known, the filesize in bytes. If unspecified, attempts to read that data from content headers.
verbose (int | bool) – Verbosity flag. Quiet is 0, higher is more verbose. Defaults to 1.
timeout (float | NoParamType) – Specify timeout in seconds for
urllib.request.urlopen()
. (if not specified, the global default timeout setting will be used) This only works for HTTP, HTTPS and FTP connections for blocking operations like the connection attempt.progkw (Dict | NoParamType | None) – if specified provides extra arguments to the progress iterator. See
ubelt.progiter.ProgIter
for available options.requestkw (Dict | NoParamType | None) – if specified provides extra arguments to
urllib.request.Request
, which can be used to customize headers and other low level information sent to the target server. The common use-case would be to specifyheaders: Dict[str, str]
in order to “spoof” the user agent. E.g.headers={'User-Agent': 'Mozilla/5.0'}
. (new in ubelt 1.3.7).
- Returns:
fpath - path to the downloaded file.
- Return type:
str | PathLike
- Raises:
URLError - if there is problem downloading the url. –
RuntimeError - if the hash does not match the hash_prefix. –
Note
Based largely on code in pytorch [TorchDL] with modifications influenced by other resources [Shichao_2012] [SO_15644964] [SO_16694907].
References
[Shichao_2012]https://blog.shichao.io/2012/10/04/progress_speed_indicator_for_urlretrieve_in_python.html
Example
>>> # xdoctest: +REQUIRES(--network) >>> # The default usage is to simply download an image to the default >>> # download folder and return the path to the file. >>> import ubelt as ub >>> url = 'http://i.imgur.com/rqwaDag.png' >>> fpath = download(url) >>> print(ub.Path(fpath).name) rqwaDag.png
Example
>>> # xdoctest: +REQUIRES(--network) >>> # To ensure you get the file you are expecting, it is a good idea >>> # to specify a hash that will be checked. >>> import ubelt as ub >>> url = 'http://i.imgur.com/rqwaDag.png' >>> fpath = ub.download(url, hasher='sha1', hash_prefix='f79ea24571da6ddd2ba12e3d57b515249ecb8a35') >>> print(ub.Path(fpath).name) Downloading url='http://i.imgur.com/rqwaDag.png' to fpath=...rqwaDag.png ... ...1233/1233... rate=... Hz, eta=..., total=... rqwaDag.png
Example
>>> # xdoctest: +REQUIRES(--network) >>> # You can save directly to bytes in memory using a BytesIO object. >>> import ubelt as ub >>> import io >>> url = 'http://i.imgur.com/rqwaDag.png' >>> file = io.BytesIO() >>> fpath = ub.download(url, file) >>> file.seek(0) >>> data = file.read() >>> assert ub.hash_data(data, hasher='sha1').startswith('f79ea24571')
Example
>>> # xdoctest: +REQUIRES(--network) >>> # Bad hashes will raise a RuntimeError, which could indicate >>> # corrupted data or a security issue. >>> import pytest >>> import ubelt as ub >>> url = 'http://i.imgur.com/rqwaDag.png' >>> with pytest.raises(RuntimeError): >>> ub.download(url, hasher='sha512', hash_prefix='BAD_HASH')
- ubelt.util_download.grabdata(url, fpath=None, dpath=None, fname=None, redo=False, verbose=1, appname=None, hash_prefix=None, hasher='sha512', expires=None, **download_kw)[source]¶
Downloads a file, caches it, and returns its local path.
If unspecified the location and name of the file is chosen automatically. A hash_prefix can be specified to verify the integrity of the downloaded data.
- Parameters:
url (str) – url of the file to download
fpath (Optional[str | PathLike]) – The full path to download the file to. If unspecified, the arguments dpath and fname are used to determine this.
dpath (Optional[str | PathLike]) – where to download the file. If unspecified appname is used to determine this. Mutually exclusive with fpath.
fname (Optional[str]) – What to name the downloaded file. Defaults to the url basename. Mutually exclusive with fpath.
redo (bool) – if True forces redownload of the file. Defaults to False.
verbose (int) – Verbosity flag. Quiet is 0, higher is more verbose. Defaults to 1.
appname (str | None) – set dpath to
ub.get_app_cache_dir(appname or 'ubelt')
if dpath and fpath are not given.hash_prefix (None | str) – If specified, grabdata verifies that this matches the hash of the file, and then saves the hash in a adjacent file to certify that the download was successful. Defaults to None.
hasher (str | Hasher) – If hash_prefix is specified, this indicates the hashing algorithm to apply to the file. Defaults to sha512. NOTE: Only pass hasher as a string. Passing as an instance is deprecated and can cause unexpected results.
expires (str | int | datetime.datetime | None) – when the cache should expire and redownload or the number of seconds to wait before the cache should expire.
**download_kw – additional kwargs to pass to
ubelt.util_download.download()
. This includeschunksize
,filesize
,timeout
,progkw
, andrequestkw
.
- Returns:
fpath - path to downloaded or cached file.
- Return type:
str | PathLike
CommandLine
xdoctest -m ubelt.util_download grabdata --network
Example
>>> # xdoctest: +REQUIRES(--network) >>> import ubelt as ub >>> url = 'http://i.imgur.com/rqwaDag.png' >>> fpath = ub.grabdata(url, fname='mario.png') >>> result = basename(fpath) >>> print(result) mario.png
Example
>>> # xdoctest: +REQUIRES(--network) >>> import ubelt as ub >>> import json >>> fname = 'foo.bar' >>> url = 'http://i.imgur.com/rqwaDag.png' >>> prefix1 = '944389a39dfb8fa9' >>> fpath = ub.grabdata(url, fname=fname, hash_prefix=prefix1, verbose=3) >>> stamp_fpath = ub.Path(fpath + '.stamp_sha512.json') >>> assert json.loads(stamp_fpath.read_text())['hash'][0].startswith(prefix1) >>> # Check that the download doesn't happen again >>> fpath = ub.grabdata(url, fname=fname, hash_prefix=prefix1) >>> # todo: check file timestamps have not changed >>> # >>> # Check redo works with hash >>> fpath = ub.grabdata(url, fname=fname, hash_prefix=prefix1, redo=True) >>> # todo: check file timestamps have changed >>> # >>> # Check that a redownload occurs when the stamp is changed >>> with open(stamp_fpath, 'w') as file: >>> file.write('corrupt-stamp') >>> fpath = ub.grabdata(url, fname=fname, hash_prefix=prefix1) >>> assert json.loads(stamp_fpath.read_text())['hash'][0].startswith(prefix1) >>> # >>> # Check that a redownload occurs when the stamp is removed >>> ub.delete(stamp_fpath) >>> with open(fpath, 'w') as file: >>> file.write('corrupt-data') >>> assert not ub.hash_file(fpath, base='hex', hasher='sha512').startswith(prefix1) >>> fpath = ub.grabdata(url, fname=fname, hash_prefix=prefix1) >>> assert ub.hash_file(fpath, base='hex', hasher='sha512').startswith(prefix1) >>> # >>> # Check that requesting new data causes redownload >>> #url2 = 'https://data.kitware.com/api/v1/item/5b4039308d777f2e6225994c/download' >>> #prefix2 = 'c98a46cb31205cf' # hack SSL >>> url2 = 'http://i.imgur.com/rqwaDag.png' >>> prefix2 = '944389a39dfb8fa9' >>> fpath = ub.grabdata(url2, fname=fname, hash_prefix=prefix2) >>> assert json.loads(stamp_fpath.read_text())['hash'][0].startswith(prefix2)