Hard scrubbing required

In the last post I showed how I created a web-scraper that retrieves data from wine.com’s best rated wine list. The scraper produced a dataset with 23,822 rows of wine data that included: wine name, vintage year, origin, price, average ratings and total ratings. Refer here to see how I did this.

This post focuses on the data cleaning process of the results from the web-scraper. The data looks like this:

pandas dataframe constructed by the web-scraper

The cleaning process was probably the most challenging part of this project so I will go into details on the main quality issues I identified. Two of those issues had to do with columns that contained more than one set of relevant data, and the other was a variable that contained strings that weren’t actually a description of the variable in question.

Extracting relevant digits out of a string

I wanted to separate the year from the ‘product_name’, and create a separate ‘year’ column. Part of the challenge here is that some of the product names had some other digits to designate things like bottle sizes, so the digit extraction method would extract other values besides the year -which is what I wanted. I noticed that the non-year digits tended to be in parenthesis, so it was easy to just remove the parenthesis and its contents.

After that hurdle, it was pretty easy to extract the digits, and only yielded in about 800 false positive from what I could tell. After several other approaches, this was the one that resulted in the least amount of non-years being extracted. Overall, I identified about 800 entries that were probably not years (anything under 1000 because that would be really old wine!).

Here is the code that did all of this.

The next judgement calls will come during the analyze portion since some of the older years (like anything prior to 1950) are probably anniversary bottles or something else not indicated the actual vintage year. AS far as data cleaning goes though, this was good enough.

Separating two relevant strings in uneven lists

The second challenge came when I wanted to separate ‘origin’ into two variables: ‘appellation’ and world ‘region’. The ‘origin’ column had values that looked like uneven lists.

Some wine appears as Rutherford, Napa Valley, California or just Napa Valley, California. For the sake of consistency, I chose to drop the specific town like Rutherford and only keep what I thought of as ‘appellation’ and world ‘region’. ‘appellation’ was going to be specific enough for my purposes that I didn’t need to know exactly where the wine was produced.

To do this, I simply created a new dataframe with just the result of ‘origin’ column into two, and further cleaned the column that happened to have two values. Once I had this dataframe, I simply merged it with the main one and drop the ‘origin’ column.

Here is the code:

What this yielded was 128 unique appellations in 16 different world regions. Right there is when I began to learn about the different wine regions in the world: Argentina, Australia, Austria, California, Chile, France, Germany, Greece, Italy, New Zealand, Oregon, Portugal, South Africa, Spain, Washington, and Other U.S.

Removing rows that contain certain values

I wanted to get the varietals cleaned up because that data was going to be important in answering my questions. After reviewing the list of unique varietals, I noticed that there were non-varietals designations in that column. Things like ‘Port’ ‘Non-Vintage Sparkling Wine’ ‘Vintage Sparkling Wine’ don’t describe bonafide varietals to me.

Here is the simple code to eliminate the rows and create a new ‘df_varietals’ dataframe.

The amount of rows with non-varietals only represented a small sample of 514 rows so I felt it was OK to potentially leave them out of some of the analysis. However, since there were going to be only analyze were varietals weren’t going to be evaluated, it was just easier to create a new dataframe without the those entries that I would only use when looking at the varietal data.

Parting Thoughts

There were some other data cleanliness issue along the way like missing values or values stored as the wrong datatype. I converted those to make sure that all the dataframes were ready for manipulation in the assess and analyze portions of the project.

One of the values that I was unable to separate out was the winery name. Unfortunately the way wine.com constructed the product name was concatenating the name, winer, and year all together, making it really hard to distinguish wine name from winery name. Also this is unfortunate, it will not deter my analysis to answer the questions that I have, as long as the product names are unique.

I ended up storing all the dataframes in a SQLAlchemy database for easy retrieval later one. I was worried that if I just stored them in a .CSV, I might loose or add weird formatting in the process.

To read about some of my preliminary findings with this dataset, stay tuned for Part 3!

If you want to view the full codebase, you can view the GitHub repo here.

Thanks for reading and hope you found this article informative!

--

--