Deploying a Python web scraper on Google Cloud Platform | by Jaka Rizmal | Medium

 
notion image

In this article, I’ll show you how I built my first web scraper, deployed it in a Cloud Function and make it so it repeats every 10 minutes.

Photo by C Dustin on Unsplash
This article consists of three main parts. In first part, I’ll quickly show you how I built a scraper itself, but I wont go into detail, as there are a lot of resources on this topic already. In the second part, I’ll show you how to set up a Cloud Function in Google Cloud Platform. And in the third part, I’ll show you how to make it repeat itself every 10 minutes.

Web scraper

I chose Python and a library called BeautifulSoup4. Based on my quick research, the library is very popular, which usually means lees headache later on. I was surprised at how nice it was to learn (I had zero experience).
As I said before, I won’t go into detail, as there are lots of very good tutorials online. The official docs are also great.

Basics

I used BeautifulSoup4 and “requests” library. With requests, I first accesed the website that I’m scraping, and then fed the raw response to BeautifulSoup4, which then enabled me to dig through the page in Python.
In line 1, I send a GET request to the URL provided, then in line 2, I get the whole page document as a string. The “windows-1250” part is the page encoding. This is usually “UTF-8” but I scraped a page that used this encoding. In last step, we feed the string to a bs4 constructor and thus we have created the “soup” object.
Now we’re ready for digging.

Scraping

In my case, the page I was scraping displayed results in rows. all the results were structured the same so it was very easy to pull the data out.
First, I got all the result rows. I was looking for “div”s with a class “result-row”. Then for every row, I looked for a “span” with a class “result-title” and saved its’ value to a list.

Saving data

At the end, I saved all this data to a Firestore database.
Good, now I have the titles in my database. But there is a problem. New titles are always coming in at a steady rate and I want to have my database to be as up to date as possible.

Automating it

If I had time, I would run this script every hour or two, and that would be it. But I dont have any time, plus, I want to learn something new.
At first, I thought about setting up a CRON job on my computer, that runs the Python script every couple of minutes, but I quickly ruled that out as I don’t want to have random scripts running on my PC and it wouldn’t work at night, when my PC is turned off.
I decided that I would run the script somewhere on the cloud, where it could live and repeat itself infinitely.

Google Cloud Platform

I chose GCP, because I’ve used it before to host websites and databases. But I’ve never done anything like this before. After some research I had a good idea how the architecture would need to look.
Cloud FunctionThe whole script I wrote for my PC would be put into a cloud function. A cloud function is basically a piece of code that lives somewhere on Google servers, and waits for a trigger, does it’s thing and quits. I was really surprised how easy it was to set it up. I wrapped my code into a function and uploaded it, that was it.

Creating a function

At first, you need to go to Google Cloud Console and choose “Cloud Functions” in the main menu. When you click “create a function”, a form appears.
Form step one
At step one, you name a function and then set it’s memory. I’ve chosen 256MiB as my function was fairly simple, but I would recommend trying it out for yourself. But then again, running these functions is very cheap, so rather go a bit higher.
Form step two
The second step is to set the trigger, so that a function knows when to run. The function could listen to a HTTP request or something else, but I’ve chosen a Cloud Pub/Sub mechanism as it seemed the easiest. I’ll now quickly explain how that works.
Cloud Pub/Sub
This is a mechanism that is used for communicating or triggering between different components in GCP. At first you need to create a Pub/Sub topic, which acts something like an event that some components can trigger and other components can be subscribed to.
Pub/Sub topic
Just search “Pub/Sub” in search bar and click create topic, name it and save it. I’ve created a topic named “update-db” which I’ll use for triggering the function.

Let’s continue creating the function

Function trigger
Now that we’ve created the Pub/Sub topic, we select it in the dropdown and continue.

Uploading the source code

Uploading source code
As you can see, you could write the code in the inline editor in the browser itself, but I’d suggest to ZIP it first and then upload.
You also need to choose a runtime. These functions can be written in lots of different languages, but mine is written in Python 3.8, so I chose that option.
As you can see, there is a “Function to execute” field at the bottom. You need to wrap your script in a function named the same, so Google knows what to run.
One more thingYou need to add a “requirements.txt” file to the ZIP archive, where you list all the “pip” dependencies your function uses.
requirements.txt
That should be it for the function part.

Triggering the function every 10 minutes

Remember the Pub/Sub topic we created before and subscribed to it in the “Trigger” section of the function? Now we need to actually trigger it somehow. Preferrably once every 10 minutes.

Cloud Scheduler

For that purpose I created a Cloud Scheduler. Find it in the main menu and click “Create a job”.
Scheduler job creation form
You can name it anything you want, then add a description.
For the frequency, you must provide a string in a “unix-cron” format. It’s an interesting topic as itself, but at this moment I’ll just show you my case. In my case, the string is “*/10 * * * *”.
There’s also this great website which helps you with that. https://crontab.guru/every-10-minutes
Then, choose your timezone, you can just write out your country and it will help you.
Then write in the Pub/Sub topic name that you want to trigger. You need to copy the topic name that you subscribed to in the function. The payload doesn’t serve us any purpose in this case. It’s used to transfer some data with the triggering of a topic. Put anything you want in here.

That’s it!

By now, the Cloud Scheduler should be triggering the Pub/Sub topic, which is sending a pulse to our Cloud Function. It’s nicely updating our database every 10 minutes.
The techniques used in this article could be used in many applications. I’m happy with the result and I’ve learned something new which is awesome.
Thank you for reading my first Medium article!