Web-scraping a Paginated Online Wine Retailer’s Website (Part 1/5)
I’m in the midst of working on a a project that scraps, cleans, analyzes, and visualizes data from the online wine retailer wine.com. Specifically, I collected information from over 23,000 wines rated 94 points and above to see if I could answer some questions or glean any sort of insights. In this post, I focus on the web scraping part only.
I was interested in web scraping because I thought it was a stealthy way of obtaining a unique dataset to work with. For me it’s the closest thing I’ve ever done to ‘hacking’ if you think about it in the way movies think about ‘hacking’. In other words, I think I’m being super cool. Plus I love wine and have some questions about which varietals tend to be fanciest, where are those wines from, do ratings have any sort of influence over prices, etc etc etc.
What is Web-Scraping?
So it turns out web-scraping is pretty easy. You definitely need to know some HTML though, and be familiar with how to read HTML in your browser’s Developer Tools. Web-scraping essentially is looking up strings in a tree of HMTL tags. All the HTML tags are what constructs a web page, which you can then investigate though a series of parent, child, and sibling relationships. This will make more sense we I go over the structure of the web page I scraped, and show you you only the relevant part of the HTML structure.
Now let’s talk about tools. To actually get the data, I used several free tools and made many web searches to help me overcome challenges and obstacles. There is nothing ultra complicated in this process, and you probably won’t spend too much time learning how to web-scrape. The hardest part if probably finding a website that interests you.
To process the data, I used Python inside of a Jupyter Notebook. To store the data, I just used pandas to create a .CSV. However, I might eventually store the data in a database to make data retrieval easy in my other notebooks…
How to Scrape
My web-scraper of choice is Beautiful Soup (bs4). There is not a particular reason why I picked this package besides the fact that it’s free and I remembered the name from somewhere. Beautiful Soup is a library that I imported in my Jupyter Notebook, along with the Requests library. Requests is super important because this is what will allow you to make an HTTP-based API request to the web page directly from within the notebook. The response from the request will be what Beautiful Soup parses through and constructs into an object, that you can then process and retrieve the golden nuggets of data you want to input in a pandas dataframe.
The code that does the request and creates the soup object is in my following main_execute() function:
Now that we know how to request the data, let’s take a look at the web page itself. The webpage I looks like this:
What you’ll want to do next is open your Developer Tools (on Chrome and Mac for example, hold the keys ‘option + command + i’ on the webpage you want). In the developer console, look for the Elements tab which will show you the webpage’s HTML. Look at the HTML version of the webpage you want to scrape, and take a few minutes to investigate how it’s structured. Then click on the elements you are interested in, and check out where they fall in the structure. Do you see any patterns? Is the data nested? What kinds of elements are the values in? Are there unique names that you could use?
After inspecting my webpage, I found a pretty simple structure. Here it is summarized relevant web-page information:
What you want to look at above are the unique class names for each of the elements containing the data I wanted. Since they are unique, this is what I will use to identify them in my scraper. There are other attributes besides classes that you can use too, so definitely check out the Beautiful Soup documentation.
I’ll cut to the chase but all the <li> elements look the same in my webpage look the same. This will be the repeatable pattern that I will iterate over!
Constructing the Scraper
In my main_execute() function, you might have noticed get_info(). This is the function that will actually scrape the data. You’ll recognize some of the class name in the example HTML image which are the unique names I used to retrieve the string in each of the elements.
By putting this code in a function like that, I am able to call it every time I need to, which is for every list elements all the wine data is stored in.
Dealing with Pagination
One thing I noticed while inspecting the webpage is that only 25 list elements were being displayed and when I scrolled to the bottom, the next 25 elements appear. And when the next 25 elements appeared, the original url is modified to add a digit to indicate how many pages of these new pages have been loaded. This is called pagination. Dun dun dun!
If you scroll back up to my main_execute() function, you’ll notice I call a function called pagination(). This function in turn calls and function named url_constructor().
This is what the pagination() and url_constructor() functions look like:
The pagination() function basically counts in ascending order a variable named page, and passes that number to a url_constructor() which takes the original url, and adds that number in the url. pagination() will do this until it gets an error indicating that there are no more valid urls. In other words, when there are no more list elements to display.
The two functions are able to work together this way because Python used this way is synchronous, meaning that when pagination() calls url_constructor(), the compiler will wait until the code in url_constructor() is done executing before it continues through the loop.
I ended up with something that looked like this:
Pretty cool right? So there you have it. A web-scraper that will executes only one time, and collects all of the data located on the wine.com webpage. In all, this code collected 23,822 rows of data in almost two hours. I had lunch, walked my dog, and came back to see everything had run.
To see what the data looks like and what I had to do to it. Stay tuned for Part 2 where I go over the data cleaning process!
If you want to view the full codebase, you can view the GitHub repo here.
Thanks for reading and hope you found this article informative!