Web Scraper to ‘one-click’ download PDF on a website

Avocado Aun
4 min readApr 21, 2020

--

Because lazy people will always find a smart way to do it

Why do we scrap the web?

Websites are presented to us in the way the web developers wanted them to be seen. That’s why you see the exact same page design and contents when you browse to the same URL on Chrome or Safari or … (a few moments later) Internet Explorer. In context of web scraping, there are times when we only want certain information or data from the website; and we need them fast. A classic example is to extract prices information from Amazon.com every 1–2 days to track the prices changes of a listed product (shopaholic approved!)

How does a web scraper works?

A scraper (sometimes called ‘spider’) is like a ‘bot’ that automatically select elements on a website to be viewed or exported. Think of it as having an intern to click through and download each elements on the website, including the banners, headers, any clickable texts, links and pictures. Now replace that intern with your scraper, no coffee needed! Here’s a insighful video to explains the logic behind web scraping.

Does it always work?

The short answer is yes and no. Ok scraping looks very similar to something you see on a spy movies; but in fact it is very heavily ‘hard-coded’. A scraper built for Amazon.com would not work out of the box to scrap Apple.com. Normally a scraper is designed just to work on one website; for a few reasons: (1) websites are coded differently in different web languages, so they are syntactical different (2) websites are built by different developers with different style of ‘writing’ (3) websites are updated rather frequently resulting in changes in the code structure and syntax. A ‘general-purpose’ crawler might works on more websites, but they are not useful to extract fine-grain information. Like-wise, a ‘special-built’ crawler can extract more information but they only work for that specific websites. Just like blue-team and red-team are constantly in the endless rat-race to counter each others in information security; web scrapist (in that even a word?) are always updating their crawlers to accommodate changes made by the web developers.

How to scrap the web?

For the laziest people within lazy peoples, check out parsehub.com. This story is not an ads but if you into website scraping, and do not want to get your hands dirty then this is one of the go-to tools. Parsehub is just another extra tab on your browser and you can feel the power of scraping with a few clicks of button.

For those feeling adventurous, there are many good guides for starters to get into the game of scraping using python like this one here.

For data scientist, or if you think you are one, try Scrapy. Scrapy is one of the most popular scraping tools used in data collection in a machine-learning pipeline.

For this story, we will demonstrate a python script to use pywinauto to ‘crawl’ a University Websites to automatically download all the PDFs found on the webpage.

The challenge

For some students who works better under pressure, turns out that there is an easy way to ‘download ALL the past years exam papers in one-click’ in your last minute quest to make your ASEAN mums proud. All you have to do is just key in your login credentials, navigate to the course page you want the papers for and then…BOOM! Here’s a short preview on the what final programs can do for you.

The Code

In this section, we highlight some important parts of the code that we think you should know so that you can customize the scraper to crawl other websites. To stay ‘true’ to our lazy people mantra, you can jump to the last section of this story to get the pastyearpaperDownloadBot.exe program. Although you’re required to key in your credentials to use the program; going through the codes we are sure that nothing fishy is happening here.

Section by section explanation to be added soon

The Executable

Congratulations you’ve found the automatic downloader program. It’s here.

If you made it here after completing all sections, here’s a badge for you — A for . If you’ve chosen the road less taken and skipped here directly, congratulations you are one of those lazy people always find a smart way to get things done.

Steps to use the program:

1. Start the scraper

2. When prompted by Microsoft Defender SmartScreen, click ‘More Info’ and ‘Run Anyway’

3. Login with your student ID and password (we do not ‘remember’ your logins). If you have trust issues with .exe, you can compile and execute the codes instead.

4. Let the magic begun. The papers will start downloading automatically

5. Press CTRL+C for early exit

p.s. The auto login functionality is not active yet. Only the auto-downloader works for now. Special thanks to Jin Zhang for coding the scraper. He is one of my favourite FYP student who is currently working as a tech developer with Huawei Malaysia.

--

--

Avocado Aun
Avocado Aun

Written by Avocado Aun

I’m just a little boy, lost in the tech world. But remember, love is a riddle, and life with tech is more amazing than ever

No responses yet