I need to keep track of something on a web page that I don't have control over. Rather than checking the page manually on an irregular schedule, my tendency it to automate and notify myself. I build something similar to notify me of cryptocurrency arbitrage opportunities in the boom of 2017. Sadly all of the minuscule profit and my significant initial investment were wiped when the broker/exchange I used went bankrupt. However, the development lessons from the day is one of the reasons you're seeing this post today.

My initial inclination is always to build a web scraper manually. However, it just so happened that during my random explorations on the web I discovered there is a healthy ecosystem of web scraper API's with somewhat generous free tiers. The usage is also fairly simple. Your provide a URL of a public page along with the CSS selector and call the web service and out comes a nicely return JSON with the selector text.

This was a few lines of python code that uses the requests library to make the call and return the result. Next, step is to build the IFTTT notification.

I opened up the IFTTT app on my phone but had forgotten some of their concepts and constructs, or it was kind of unintuitive for my brain to understand how it works now. So after playing around a bit I rediscovered what I done before. Basically, the IFTTT part where you configure the trigger and the action is called an Applet. For my purposes, the trigger was a webhook, an end-point where I will make a POST request with the event name in the URL and any other info as body in JSON. The action is a notification on the IFTTT app. Once this is configured, you add the URL to make the POST request in your python script once it has scraped the relevant text.

Next step is to deploy this as a cloud function. My previous experience with Google Cloud Functions was not great. Specifically, the time to deploy, build and test was quite long for my liking. I did some googling around and found the AWS Lambda is a better choice in almost every way compared to Google Cloud Functions according to this post. Note I haven't validated that info is correct at this point but made the decision to go with AWS Lambda anyway. Because you have a dependency on the requests module which is not natively shipped within the python environment in AWS lambda, you have to install the dependency in the project root using pip install requests -t . and then create a zip archive of the folder from with the folder i.e. upon extraction, the lambda_function.py and folders for other dependencies should be in the root.

Note: I have blanked out the URL and Selector and other identifying info because I don't want you to know what I'm scraping.

Next, you configure AWS CloudWatch as a trigger for the AWS Lambda function with the frequency you want and test the function.

Hopefully this gives you the idea of how it works. This is a much easier setup than what I experienced with Google Cloud Functions in late 2018, things may have changed by now. Another problem that I noticed was the the IFTTT notifications were taking about 3 hours to reach me. In my current use case, there isn't much of a time criticality so I can live with it. The notification delay may not primarily be with IFTTT itself but my network connection, the IFTTT app on my phone, or the configuration of notifications on my IFTTT app. I haven't investigated those possibilities fully.

IFTTT, AWS Lambda Cloud Function and Web Scraper API to build a scrape and notify app