How to download a file from a website link using Python script or, code snippet

In this article, I’m going to demonstrate some code snippets that you can utilize to download files from the Internet using Python. Before we begin, you might be wondering why go through all the hassles of writing scripts to download files when you can simply click and download it by opening it on a browser!

Why?

To ease repetitive task. For example, lets say you are browsing a website with tons of download links and you wan’t to download all these files. Now, to do this manually it will consume a lot of your time. You will get bored or frustrated once you do the same repetitive clicks over and over. This might be a good case for you to automate it using script instead of doing it manually.

Intro

I’m assuming you have a strong basic knowledge of python. I’m going to use some python libraries that are available on the python package index (pip). The codes given in this tutorial is written using Python 3 and tested on a Linux machine. Since It is written in python it should work on other Operating Systems as well.

Download Small/ Large File Using Requests

If you are downloading tiny files you can simply use python’s most popular http module called requests. Install it using pip if not installed already. The code will be similar to this:

	import requests
	filename = "logo.png"
	document_url = "https://wasi0013.files.wordpress.com/2018/11/my_website_logo_half_circle_green-e1546027650125.png"
	with open(filename, 'wb') as f:
	f.write(requests.get(document_url).content)

view raw download_tiny_file.py hosted with ❤ by GitHub

downloads a tiny file using requests module of python 3

In the above script, we are downloading the logo of my website and saving it in a file named logo.png This code should work for tiny files. However, if you want to download some massive sized file that can eat up all your ram! Instead, you can stream the file and consume the content in chunk like below:

	import requests
	chunk_size = 4096
	filename = "logo.png"
	document_url = "https://wasi0013.files.wordpress.com/2018/11/my_website_logo_half_circle_green-e1546027650125.png"
	with requests.get(document_url, stream=True) as r:
	with open(filename, 'wb') as f:
	for chunk in r.iter_content(chunk_size):
	if chunk:
	f.write(chunk)

view raw download_large_file.py hosted with ❤ by GitHub

download massive file using requests

Sometimes some files might require additional headers/cookies. With requests You can easily set headers & cookies. Easiest way to find which headers & cookies to use is by inspecting the network request using a Web Browser’s Developer tool. For example, in chrome if you press ctrl + shift + i and inspect a page. It will open up a chrome debug tool. In the network tab you can inspect the network requests headers. You can right click on the download request and select “Copy as cURL” to copy the headers as is.

Lets take a look at the way we can set headers and cookies using requests.

	import requests
	chunk_size = 4096
	filename = "logo.png"
	document_url = "https://wasi0013.files.wordpress.com/2018/11/my_website_logo_half_circle_green-e1546027650125.png"
	headers = {
	"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36",
	"Connection": "keep-alive",
	}
	s = requests.Session()
	cookie = requests.cookies.create_cookie('COOKIE_NAME','COOKIE_VALUE')
	s.cookies.set_cookie(cookie)
	with s.get(document_url, stream=True, headers=headers) as r:
	with open(filename, 'wb') as f:
	for chunk in r.iter_content(chunk_size):
	if chunk:
	f.write(chunk)

view raw download_file_using_python.py hosted with ❤ by GitHub

downloading file using requests with custom cookies & headers

Combine wget or, cURL with python to download file

In some cases, downloading Some files might be quite trouble some. For example, if the server is not properly configured or, serves file badly, you may get network error or, errors like ChunkEncodingError or, IncompleteReadError. This can happen for multiple reasons such as content length not defined by the server, or the server closes the connection abruptly, etc. To cope up with this challenge, you can take help from robust command line downloaders such as curl/wget etc. Interestingly, Python has a module called wget, that can utilize the wget available on system. The code is as simple as this:

import wget
document_url = "https://wasi0013.files.wordpress.com/2018/11/my_website_logo_half_circle_green-e1546027650125.png"
wget.download(document_url)

Of course, you will have to install wget first i.e. pip install wget.

Alternatively, you can use pretty much any other command line downloader that are available in your system such as curl, aria2c, axel etc. For example, see the below snippet:

from subprocess import call
document_url = "https://wasi0013.files.wordpress.com/2018/11/my_website_logo_half_circle_green-e1546027650125.png"
filename = "logo.png"
call(["curl", document_url, '-H', 'Connection: keep-alive', '-H', 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36', '--compressed', "--output", filename])

I’m using python’s subprocess module to invoke a terminal command to download the file using an external program called curl. I’ve set two custom headers using the -h argument. Also, output filename/path can be specified using the --output argument.

Download files that requires authentication using Python Selenium & requests

Files that requires authentication & dynamic interaction from the user such as clicking on some button to submit complex forms can be very tricky to download with the above mentioned tools. However, we can easily combine selenium python & requests to achieve it. The trick is to authenticate and do all the interaction stuff using python selenium with a webdriver say chromedriver for example. And, then copy the session, cookies from the driver and set it on a requests session and finally download the file. Sounds complicated? Look at the code below:

	from selenium import webdriver
	import requests
	username = "Your Username"
	password = "Your Password"

	driver = webdriver.Chrome()

	# authenticate using username, password
	login_url = "https://your.target_website.com/login/"
	driver.get(login_url)
	driver.find_element_by_id("username").send_keys(username)
	driver.find_element_by_id("password").send_keys(password + "\n")

	# interact with target web elements to submit a form
	driver.get(target_url)
	driver.find_element_by_id("some_button").click()

	# retreive download url
	downlaod_url = driver.find_element_by_id("div_with_download_link").find_element_by_tag_name("a").get("href")

	chunk_size = 4096
	download_filename = "logo.png"

	# your custom headers
	headers = {
	"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36",
	"Connection": "keep-alive",
	}

	s = requests.Session()

	# retrieve and set cookies from selenium to requests session
	for selenium_cookie in driver.get_cookies():
	cookie = requests.cookies.create_cookie(selenium_cookie['name'], selenium_cookie['value'])
	s.cookies.set_cookie(cookie)


	with s.get(download_url, stream=True, headers=headers) as r:
	with open(download_filename, 'wb') as f:
	for chunk in r.iter_content(chunk_size):
	if chunk:
	f.write(chunk)

	driver.quit()

view raw download_file_with_authentication.py hosted with ❤ by GitHub

selenium + chromedriver + requests combo for downloading file

If you still don’t understand, just leave a comment I will try to help you understand how it works.

BONUS Trick: Downloading PDF File using javascript, selenium, & python combination

The snippet below is for downloading pdf file using Browser’s Print option. In this script, we are automating a chrome browser with chromedriver that will download a pdf by executing a window.print() Javascript command. You will have to first install selenium & pyvirtualdisplay.

pip install selenium pyvirtualdisplay

I’ve added some comments in the code to clarify the process in detail.

	import json
	import time
	from pyvirtualdisplay import Display
	from selenium import webdriver

	document_url = "https://www.adobe.com/content/dam/acom/en/accessibility/products/acrobat/pdfs/acrobat-x-accessibility-checker.pdf"
	download_dir = "/path/to/dir/"

	# setup a virtual display using pyvirtualdisplay
	display = Display(visible=0, size=(1768, 1368))
	display.start()

	chrome_options = webdriver.ChromeOptions()
	chrome_options.add_argument("–no-sandbox")
	chrome_options.add_argument("–disable-notifications")
	chrome_options.add_argument("–disable-popup-blocking")
	chrome_options.add_argument("–disable-logging")
	chrome_options.add_argument("–log-level=3")

	# chrome option settings to enable automatic download in the specified folder without showing Save As PDF dialog box.
	chrome_options.add_argument("–kiosk-printing")
	appState = {
	"recentDestinations": [{"id": "Save as PDF", "origin": "local"}],
	"selectedDestinationId": "Save as PDF",
	"version": 2,
	}

	prefs = {
	"printing.print_preview_sticky_settings.appState": json.dumps(appState),
	"download": {
	"default_directory": download_dir,
	"prompt_for_download": False,
	"directory_upgrade": True,
	},
	}
	chrome_options.add_experimental_option("prefs", prefs)
	driver = webdriver.Chrome(chrome_options=chrome_options)
	driver.get(document_url)
	time.sleep(10)
	driver.execute_script("window.print();")
	time.sleep(30)
	driver.quit()
	display.stop()