Leave No Trace
Concealment and Specification through Headers and Parameters
Welcome back to another edition of ScrapeWell! As always if you’re enjoying the reading (and learning) consider sharing with friends or colleagues - the more the merrier!
TLDR
The Upshot: Learn to add headers to your scraping project to conceal your scraper from bot detection. Use parameters to dynamically define URL content in get requests.
Use Case: After this letter, you should be able to scrape more sophisticated websites and request specific website content. Want the sales at Whole Foods in Berkley? Great, we’ve got it. Now Palo Alto? Got that too - and better yet, all with the same code.
To the code!
At this point you’ve probably come across either of these situations and thrown your hands up in disgust or annoyance:
403 Status Code :(
“Recaptcha Required” :(
“Bot Detected” :(
All three are the bane of my existence and I wish with my whole heart that they could be banished every day. A bit dramatic but you catch my drift - it’s annoying. So what can we do about it? Well, let's take a step back and talk high level about what we actually try to do by scraping: we essentially mimic a person browsing a website and “reading” the information.
When these programs detect us, they’ve determined that we are not human (and are therefore not allowed). So what should we do?
Answer: we need to make the scraper more “human-like”. Now, there are many many levels to this with varying levels of complexity but hey, that’s why we’re here - to learn.
So let’s start with the most basic fix: using headers.
Analogy time!! Headers are “attributes” of your browser that tell the receiving computer/server who you are. So, by adding headers to our requests, we can effectively “introduce” ourselves to the server - something a bot typically couldn’t do.
From this point forward do NOT just copy and paste my code as my headers WILL BE DIFFERENT (I use a mac, you may use a PC, etc.) we will be using a website to find our headers and you should fill in yours where used - this is crucial.
Why is it crucial?? Well think about it, if you’re using a PC and you set your “introduction” to say “Hey I’m an Apple user!”, the server is going to be confused and mad at you - hence more likely to block you.
So let’s find our headers!
The Holy Headers
The quickest way to find your headers is to go to a website like this that displays them for you. I will go ahead and start a new script called aldi_us_scrape.py and I will place it in the models folder from the previous letter as I would like to display it in my flask application - so it looks like this:
As usual, we have the base imports and URL but now look at the definition of the “headers”. The header website is linked here again so you can replace mine with yours. Note, keep only the headers we have here, ignore other items as they are not needed for this request.
import requests
import bs4
import pandas as pd
from bs4 import BeautifulSoup
'''
These are our headers - they are mine, yours are likely to look different. Notice the format of {'variable':'value, etc.}, yours should also be formatted as a dictionary like this.
'''
headers = { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate',
'accept-language': 'en-US',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
page_url = "https://www.aldi.us/en/weekly-specials/this-weeks-aldi-finds/"
'''
Look where we have "passed" the headers, they are an argument in the request.get() function, this essentially tells requests how to introduce ourself to the server
'''
page_sourced = requests.get(page_url, headers = headers)
print(page_sourced.status_code)
'''
Output: 200 -- YAY!!! It worked!
'''
great, so yours might look a bit different (the headers) but the format - and output - should be the same! So till now, all we have done differently is add the headers dictionary to the request. We can fill out the rest of the scraper as follows:
import requests
import bs4 import
pandas as pd
from bs4 import BeautifulSoup
headers = { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate',
'accept-language': 'en-US',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
page_url = "https://www.aldi.us/en/weekly-specials/this-weeks-aldi-finds/"
page_sourced = requests.get(page_url, headers = headers).content
html_content = BeautifulSoup(page_sourced, "html.parser")
names_items = [i.text.strip() for i in html_content.find_all('div', class_="box--description--header")]
in_store = [i.text.strip()=='*see price in store' for i in html_content.find_all('span', class_="box--amount")]
min_names = []
for i in range(len(names_items)):
if not in_store[i]:
min_names.append(names_items[i])
else:
continue
dollar_price = [i.text.strip() for i in html_content.find_all('span', class_="box--value")]
cent_price = [i.text.strip() for i in html_content.find_all('span', class_="box--decimal")]
price_now = [dollar_price[i]+cent_price[i] for i in range(len(cent_price))]
df = pd.DataFrame({'Sale Item': min_names, 'Sale Price':price_now})
print(df)
'''
Output:
Sale Item Sale Price
0 Crofton 6-Piece Glass Bowl Set with Snapping Lids $8.99
1 First Alert Waterproof Fire Safe $24.99
2 Crofton 30-Piece Assorted Food Storage $4.99
3 Crofton Cut Fruit Bowl $8.99 '''
For another view of this code check out the GitHub link here. The .strip() function is useful here as it removes any spaces/new lines from the text. We also use list comprehensions to compact things but we apologize if this is not the easiest to read on mobile - we will work on this to see if there is a better way in the future, but for now, you have the GitHub link!
Awesome, so now we know how to use headers, you put them in the requests like the above, so now, let's move to something a bit different, inputting data to our request.
Helpful Parameters
It turns out we can pass more information to the server than just who we are - these additions are called parameters. It also happens to also be a more natural way for a human to browse a webpage: you don’t search the exact site and parameters right away, you’re typically directed there - which is why we use parameters! Use cases for parameters would be to indicate your zip code or store id number, add a search term or filter function, or add a page number to your request. How do we use them, well, let’s see!
Say we want to scrape Whole Foods again - like our first foray with BeautifulSoup - but this time we have reason to believe that not all stores have the same sales. How would we use the knowledge of store IDs to get the sales from specific stores? This is how:
import requests
import bs4
from bs4 import BeautifulSoup
page_url = "https://www.wholefoodsmarket.com/sales-flyer"
params = {'store-id':'10005'} #here we set the store ID
'''
For right now do not worry about exactly how to get the parameter name but in general you will see it in a URL like the following
--------------- Important ------------------------ https://www.wholefoodsmarket.com/sales-flyer?store-id=10005
is the same as
params = {'store-id' : '10005'} requests.get(https://www.wholefoodsmarket.com/sales-flyer, params = params)
-------------------------------------------------
'''
page_sourced = requests.get(page_url, params = params)
print(page_sourced.status_code)
'''
Output: 200 - Always check the response first
'''
So we got a 200, awesome, so let's do the next part and get the response content so we can be sure it’s right! Note, I have not used headers here just to make the code cleaner but we can absolutely use them by adding another argument to the requests.get() - (eg. requests.get(URL, headers = headers, params = params))
import requests
import bs4
import pandas as pd
from bs4 import BeautifulSoup
page_url = "https://www.wholefoodsmarket.com/sales-flyer"
params = {'store-id':'10005'}
page_sourced = requests.get(page_url, params = params).content
html_content = BeautifulSoup(page_sourced, "html.parser")
sale_items = html_content.findAll('h4', class_="w-sales-tile__product")
sale_item_titles = [i.text for i in sale_items]
sale_item_price = html_content.findAll('span', class_="w-sales-tile__sale-price w-header3 w-bold-txt")
sale_item_prices = [i.text for i in sale_item_price]
df = pd.DataFrame({'Sale Item': sale_item_titles, 'Sale Price':sale_item_prices})
print(df)
'''
Output:
Index Sale Item Sale Price
0 Red Cherries* $4.99/lb 1 Raspberries* $3.50 ea 2 All Beyond Meat Products* 25% off 3 Fresh Halibut Fillets* $21.99/lb 4 Chicken Breast Kabobs or Skewers* $5.99/lb 5 No Shells Pistachios* $1 off 6 Sparkling Water, 12 pk* 2 for $8.00 7 Plain or Marinated Beef Skirt Steaks* $8.99/lb 8 Select Kite Hill Plant-Based Products* 25% off 9 Strawberries* $3.33 ea 10 Black, Red or Green Grapes* $2.99/lb '''
Note a random number will not necessarily work as a store id, you have to find them. Lucky for us I wrote a little bit of code to find them all so you can test this. The store locations and corresponding URLs are here: check each URL for the store id (you can find it after the ‘?’ in the URL).
More generally, if you inspect links, you will find the following format: https://domain.com/page?something=a&something_else=b, etc. These (each something#=var) are just parameters and we can use them to our advantage! It's just a cheeky way of getting the page we want faster without having to find the exact URL and it even makes the scraper seem even more human-like.
Recap
So we have covered it all - I’m joking spare me - but seriously you’re already far on your way to being able to scrape a lot of websites. I want to conclude with some best practice advice: Always add some time delays to your code. Why? So you don’t overload the server. This could be in the form of random time delays or specific wait times before making request executions. This is crucial, as it is one way to avoid being caught by bot detection. Why? Well, uh, you’re not the flash, even if you wish you were, and, let’s face it — you cant browse at computer speeds. Then again, I don’t know, maybe you can. Another reason is to practice scraping etiquette - we don’t want to bring the site down, so let’s not overload the server.
And finally, something to keep in mind - parameters do not have to be statically defined in your program. You can allow a user to input some data that flows to the parameter - eg. I live in Berkley hence my Whole Foods store ID is defined by my location to be the one for Berkley (try implementing this logic with the csv provided here). Why do this? It allows users the freedom to scrape dynamically with variable inputs in real-time. The subject of this week’s ByteScrape will be related to this topic, more specifically, building flask applications to pass variables through to a scraper - we will get back to this another day but it’s just a heads up for what is to come to keep you excited!
As always if you’ve made it this far, thank you for reading and I hope you enjoyed it. Consider following ScrapeWell on Twitter here if you want more ad hoc tips on scraping and general project programming! Also, feel free to send me a discord message if something was inadequately covered, and happy scraping!!
Peace,
bs354