In this article, I’m going to demonstrate some code snippets that you can utilize to download files from the Internet using Python. I’m assuming you have a strong basic knowledge of python. I’m going to use some python libraries that are available on the python package index (pip). The codes given in this tutorial is written using Python 3 and tested on a Linux machine. Since It is written in python it should work on other Operating Systems as well.

Download Small/ Large File Using Requests

If you are downloading tiny files you can simply use python’s most popular http module called requests. Install it using pip if not installed already. The code will be similar to this:

downloads a tiny file using requests module of python 3

In the above script, we are downloading the logo of my website and saving it in a file named logo.png This code should work for tiny files. However, if you want to download some massive sized file that can eat up all your ram! Instead, you can stream the file and consume the content in chunk like below:

download massive file using requests

Sometimes some files might require additional headers/cookies. With requests You can easily set headers & cookies. Easiest way to find which headers & cookies to use is by inspecting the network request using a Web Browser’s Developer tool. For example, in chrome if you press ctrl + shift + i and inspect a page. It will open up a chrome debug tool. In the network tab you can inspect the network requests headers. You can right click on the download request and select “Copy as cURL” to copy the headers as is.

Lets take a look at the way we can set headers and cookies using requests.

downloading file using requests with custom cookies & headers

Combine wget or, cURL with python to download file

In some cases, downloading Some files might be quite trouble some. For example, if the server is not properly configured or, serves file badly, you may get network error or, errors like ChunkEncodingError or, IncompleteReadError. This can happen for multiple reasons such as content length not defined by the server, or the server closes the connection abruptly, etc. To cope up with this challenge, you can take help from robust command line downloaders such as curl/wget etc. Interestingly, Python has a module called wget, that can utilize the wget available on system. The code is as simple as this:

import wget
document_url = ""

Of course, you will have to install wget first i.e. pip install wget.

Alternatively, you can use pretty much any other command line downloader that are available in your system such as curl, aria2c, axel etc. For example, see the below snippet:

from subprocess import call
document_url = ""
filename = "logo.png"
call(["curl", document_url, '-H', 'Connection: keep-alive', '-H', 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36', '--compressed', "--output", filename])

I’m using python’s subprocess module to invoke a terminal command to download the file using an external program called curl. I’ve set two custom headers using the -h argument. Also, output filename/path can be specified using the --output argument.

Download files that requires authentication using Python Selenium & requests

Files that requires authentication & dynamic interaction from the user such as clicking on some button to submit complex forms can be very tricky to download with the above mentioned tools. However, we can easily combine selenium python & requests to achieve it. The trick is to authenticate and do all the interaction stuff using python selenium with a webdriver say chromedriver for example. And, then copy the session, cookies from the driver and set it on a requests session and finally download the file. Sounds complicated? Look at the code below:

selenium + chromedriver + requests combo for downloading file

If you still don’t understand, just leave a comment I will try to help you understand how it works.

BONUS Trick: Downloading PDF File using javascript, selenium, & python combination

The snippet below is for downloading pdf file using Browser’s Print option. In this script, we are automating a chrome browser with chromedriver that will download a pdf by executing a window.print() Javascript command. You will have to first install selenium & pyvirtualdisplay.

pip install selenium pyvirtualdisplay

I’ve added some comments in the code to clarify the process in detail.

download pdf by using javascript’s window.print() with selenium

If you still have any question feel free to ask!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: