In this article, I’m going to demonstrate some code snippets that you can utilize to download files from the Internet using Python. Before we begin, you might be wondering why go through all the hassles of writing scripts to download files when you can simply click and download it by opening it on a browser!

Why?

To ease repetitive task. For example, lets say you are browsing a website with tons of download links and you wan’t to download all these files. Now, to do this manually it will consume a lot of your time. You will get bored or frustrated once you do the same repetitive clicks over and over. This might be a good case for you to automate it using script instead of doing it manually.

Intro

I’m assuming you have a strong basic knowledge of python. I’m going to use some python libraries that are available on the python package index (pip). The codes given in this tutorial is written using Python 3 and tested on a Linux machine. Since It is written in python it should work on other Operating Systems as well.

Download Small/ Large File Using Requests

If you are downloading tiny files you can simply use python’s most popular http module called requests. Install it using pip if not installed already. The code will be similar to this:

import requests
filename = "logo.png"
document_url = "https://wasi0013.files.wordpress.com/2018/11/my_website_logo_half_circle_green-e1546027650125.png"
with open(filename, 'wb') as f:
f.write(requests.get(document_url).content)
downloads a tiny file using requests module of python 3

In the above script, we are downloading the logo of my website and saving it in a file named logo.png This code should work for tiny files. However, if you want to download some massive sized file that can eat up all your ram! Instead, you can stream the file and consume the content in chunk like below:

import requests
chunk_size = 4096
filename = "logo.png"
document_url = "https://wasi0013.files.wordpress.com/2018/11/my_website_logo_half_circle_green-e1546027650125.png"
with requests.get(document_url, stream=True) as r:
with open(filename, 'wb') as f:
for chunk in r.iter_content(chunk_size):
if chunk:
f.write(chunk)
download massive file using requests

Sometimes some files might require additional headers/cookies. With requests You can easily set headers & cookies. Easiest way to find which headers & cookies to use is by inspecting the network request using a Web Browser’s Developer tool. For example, in chrome if you press ctrl + shift + i and inspect a page. It will open up a chrome debug tool. In the network tab you can inspect the network requests headers. You can right click on the download request and select “Copy as cURL” to copy the headers as is.

Lets take a look at the way we can set headers and cookies using requests.

import requests
chunk_size = 4096
filename = "logo.png"
document_url = "https://wasi0013.files.wordpress.com/2018/11/my_website_logo_half_circle_green-e1546027650125.png"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36",
"Connection": "keep-alive",
}
s = requests.Session()
cookie = requests.cookies.create_cookie('COOKIE_NAME','COOKIE_VALUE')
s.cookies.set_cookie(cookie)
with s.get(document_url, stream=True, headers=headers) as r:
with open(filename, 'wb') as f:
for chunk in r.iter_content(chunk_size):
if chunk:
f.write(chunk)
downloading file using requests with custom cookies & headers

Combine wget or, cURL with python to download file

In some cases, downloading Some files might be quite trouble some. For example, if the server is not properly configured or, serves file badly, you may get network error or, errors like ChunkEncodingError or, IncompleteReadError. This can happen for multiple reasons such as content length not defined by the server, or the server closes the connection abruptly, etc. To cope up with this challenge, you can take help from robust command line downloaders such as curl/wget etc. Interestingly, Python has a module called wget, that can utilize the wget available on system. The code is as simple as this:

import wget
document_url = "https://wasi0013.files.wordpress.com/2018/11/my_website_logo_half_circle_green-e1546027650125.png"
wget.download(document_url)

Of course, you will have to install wget first i.e. pip install wget.

Alternatively, you can use pretty much any other command line downloader that are available in your system such as curl, aria2c, axel etc. For example, see the below snippet:

from subprocess import call
document_url = "https://wasi0013.files.wordpress.com/2018/11/my_website_logo_half_circle_green-e1546027650125.png"
filename = "logo.png"
call(["curl", document_url, '-H', 'Connection: keep-alive', '-H', 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36', '--compressed', "--output", filename])

I’m using python’s subprocess module to invoke a terminal command to download the file using an external program called curl. I’ve set two custom headers using the -h argument. Also, output filename/path can be specified using the --output argument.

Download files that requires authentication using Python Selenium & requests

Files that requires authentication & dynamic interaction from the user such as clicking on some button to submit complex forms can be very tricky to download with the above mentioned tools. However, we can easily combine selenium python & requests to achieve it. The trick is to authenticate and do all the interaction stuff using python selenium with a webdriver say chromedriver for example. And, then copy the session, cookies from the driver and set it on a requests session and finally download the file. Sounds complicated? Look at the code below:

from selenium import webdriver
import requests
username = "Your Username"
password = "Your Password"
driver = webdriver.Chrome()
# authenticate using username, password
login_url = "https://your.target_website.com/login/"
driver.get(login_url)
driver.find_element_by_id("username").send_keys(username)
driver.find_element_by_id("password").send_keys(password + "\n")
# interact with target web elements to submit a form
driver.get(target_url)
driver.find_element_by_id("some_button").click()
# retreive download url
downlaod_url = driver.find_element_by_id("div_with_download_link").find_element_by_tag_name("a").get("href")
chunk_size = 4096
download_filename = "logo.png"
# your custom headers
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36",
"Connection": "keep-alive",
}
s = requests.Session()
# retrieve and set cookies from selenium to requests session
for selenium_cookie in driver.get_cookies():
cookie = requests.cookies.create_cookie(selenium_cookie['name'], selenium_cookie['value'])
s.cookies.set_cookie(cookie)
with s.get(download_url, stream=True, headers=headers) as r:
with open(download_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size):
if chunk:
f.write(chunk)
driver.quit()
selenium + chromedriver + requests combo for downloading file

If you still don’t understand, just leave a comment I will try to help you understand how it works.

BONUS Trick: Downloading PDF File using javascript, selenium, & python combination

The snippet below is for downloading pdf file using Browser’s Print option. In this script, we are automating a chrome browser with chromedriver that will download a pdf by executing a window.print() Javascript command. You will have to first install selenium & pyvirtualdisplay.

pip install selenium pyvirtualdisplay

I’ve added some comments in the code to clarify the process in detail.

import json
import time
from pyvirtualdisplay import Display
from selenium import webdriver
document_url = "https://www.adobe.com/content/dam/acom/en/accessibility/products/acrobat/pdfs/acrobat-x-accessibility-checker.pdf"
download_dir = "/path/to/dir/"
# setup a virtual display using pyvirtualdisplay
display = Display(visible=0, size=(1768, 1368))
display.start()
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("–no-sandbox")
chrome_options.add_argument("–disable-notifications")
chrome_options.add_argument("–disable-popup-blocking")
chrome_options.add_argument("–disable-logging")
chrome_options.add_argument("–log-level=3")
# chrome option settings to enable automatic download in the specified folder without showing Save As PDF dialog box.
chrome_options.add_argument("–kiosk-printing")
appState = {
"recentDestinations": [{"id": "Save as PDF", "origin": "local"}],
"selectedDestinationId": "Save as PDF",
"version": 2,
}
prefs = {
"printing.print_preview_sticky_settings.appState": json.dumps(appState),
"download": {
"default_directory": download_dir,
"prompt_for_download": False,
"directory_upgrade": True,
},
}
chrome_options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(document_url)
time.sleep(10)
driver.execute_script("window.print();")
time.sleep(30)
driver.quit()
display.stop()
download pdf by using javascript’s window.print() with selenium

If you still have any question feel free to ask!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: