In this article, I’m going to demonstrate some code snippets that you can utilize to download files from the Internet using Python. Before we begin, you might be wondering why go through all the hassles of writing scripts to download files when you can simply click and download it by opening it on a browser!
Why?
To ease repetitive task. For example, lets say you are browsing a website with tons of download links and you wan’t to download all these files. Now, to do this manually it will consume a lot of your time. You will get bored or frustrated once you do the same repetitive clicks over and over. This might be a good case for you to automate it using script instead of doing it manually.
Intro
I’m assuming you have a strong basic knowledge of python. I’m going to use some python libraries that are available on the python package index (pip). The codes given in this tutorial is written using Python 3 and tested on a Linux machine. Since It is written in python it should work on other Operating Systems as well.
Download Small/ Large File Using Requests
If you are downloading tiny files you can simply use python’s most popular http module called requests. Install it using pip if not installed already. The code will be similar to this:
In the above script, we are downloading the logo of my website and saving it in a file named logo.png
This code should work for tiny files. However, if you want to download some massive sized file that can eat up all your ram! Instead, you can stream the file and consume the content in chunk like below:
Sometimes some files might require additional headers/cookies. With requests You can easily set headers & cookies. Easiest way to find which headers & cookies to use is by inspecting the network request using a Web Browser’s Developer tool. For example, in chrome if you press ctrl + shift + i and inspect a page. It will open up a chrome debug tool. In the network tab you can inspect the network requests headers. You can right click on the download request and select “Copy as cURL” to copy the headers as is.
Lets take a look at the way we can set headers and cookies using requests.
Combine wget or, cURL with python to download file
In some cases, downloading Some files might be quite trouble some. For example, if the server is not properly configured or, serves file badly, you may get network error or, errors like ChunkEncodingError or, IncompleteReadError. This can happen for multiple reasons such as content length not defined by the server, or the server closes the connection abruptly, etc. To cope up with this challenge, you can take help from robust command line downloaders such as curl/wget etc. Interestingly, Python has a module called wget, that can utilize the wget available on system. The code is as simple as this:
import wget
document_url = "https://wasi0013.files.wordpress.com/2018/11/my_website_logo_half_circle_green-e1546027650125.png"
wget.download(document_url)
Of course, you will have to install wget first i.e. pip install wget
.
Alternatively, you can use pretty much any other command line downloader that are available in your system such as curl, aria2c, axel etc. For example, see the below snippet:
from subprocess import call
document_url = "https://wasi0013.files.wordpress.com/2018/11/my_website_logo_half_circle_green-e1546027650125.png"
filename = "logo.png"
call(["curl", document_url, '-H', 'Connection: keep-alive', '-H', 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36', '--compressed', "--output", filename])
I’m using python’s subprocess module to invoke a terminal command to download the file using an external program called curl. I’ve set two custom headers using the -h
argument. Also, output filename/path can be specified using the --output
argument.
Download files that requires authentication using Python Selenium & requests
Files that requires authentication & dynamic interaction from the user such as clicking on some button to submit complex forms can be very tricky to download with the above mentioned tools. However, we can easily combine selenium python & requests to achieve it. The trick is to authenticate and do all the interaction stuff using python selenium with a webdriver say chromedriver for example. And, then copy the session, cookies from the driver and set it on a requests session and finally download the file. Sounds complicated? Look at the code below:
If you still don’t understand, just leave a comment I will try to help you understand how it works.
BONUS Trick: Downloading PDF File using javascript, selenium, & python combination
The snippet below is for downloading pdf file using Browser’s Print option. In this script, we are automating a chrome browser with chromedriver that will download a pdf by executing a window.print()
Javascript command. You will have to first install selenium & pyvirtualdisplay.
pip install selenium pyvirtualdisplay
I’ve added some comments in the code to clarify the process in detail.
If you still have any question feel free to ask!