Interpreting the OLS summaries in Python using Pandas and statsmodel
I am currently making my way through a statistics course in Python and this is my cheat sheet when it comes to interpreting OLS results.
The screenshots below are model output from the statsmodels v0.13.0.dev0 (+34) library.
For complete project source code, see [Github project link]. There are other models in there but they aren’t detailed in this article yet.
For the following examples, I will be using wind turbine data from the USGS website. I will explore two different methods of inferential statistic: linear regression and logistic regression.
We will be using the Ordinary Least Squares (OLS) method for the linear regression models.
Also, if you aren’t familiar with what a typical wind turbine looks like, here is an image illustrating the relevant components which are mentioned in the analysis.
Not pictured: Turbine rated capacity in kilowatt (kW) is the manufacturer’s stated output power at rated wind speed.
The Math behind the regression analysis
Your dataset probably has at least two variables that can be plotted against each other, forming a scatterplot. Usually you have a dependent or response variable, and an independent or explanatory variable. So you might be thinking about a variable you are trying to explain and another you think might influence the first variable.
In inferential statistics, the desired outcome is to identify if there is a relationship between two variables, and even further to see if one variable explains the other. The response variable is the one that you want to predict so the explanatory variable ‘explains’ the response.
Here is what you might typically see in a textbook.
If formulas confuse you a bit, here is the same graph explained in plain English.
If we were going to try to predict what the value would be on the y-axis, then the best-fit line in blue helps us with that. Right now we are just try to understand the data, but maybe you can see how we could start predicting values of the response variable.
Now that we got that out of the way, let’s look at our analysis.
Ordinary Least Squares (OLS)
In inferential statistics, Ordinary Least Squares is a simple linear model that finds the best fit line for a dataset. There are other types of models that have similar uses and output, but OLS is the most common and usually introduced first.
It’s called the the least-squares because the line is found by squaring the vertical distances between the line and each point, and choosing the smallest value. You can imagine that an algorithm can compute through all the different combination much faster that we could by hand.
Let’s answer the first question.
- Does rotor diameter explain total rated capacity of a wind turbine?
I now this questions seems like the relationship would be obvious, but the obviousness can maybe help you remember how to interpret the results when there is a positive relationship.
First, let’s run the OLS model with a single variable for rotor diameter ‘t_rotor_d’ potentially explaining the rated capacity ‘t_cap’.
It will output the following which we can then interpret.
The above suggest the following:
- R-squared: 76.6% of the variability in rated capacity can be explained by the rotor diameter.
- Coef (t_rotor_d): For every 1 meter increase in the diameter of a rotor, we can expect the rate capacity to increase by 24.87 kWs. It is also positive so the relationship is positive.
- P>|t|: the p-value of 0.000 associated with rotor diameter suggests that it is statistically significant in providing information in predicting the rated capacity.
If we look at the scatter plot itself, we can visually confirm the relationship between kW capacity and the diameter of the rotor size.
We can clearly see that the rotor diameter (explanatory variable, x-axis) has a positive relationship with the rated capacity. So as rotors get bigger, a turbine generates more power.
Glad we confirmed that common sense. Let’s look at two other variables that don’t have an obvious relationship in the next question.
2. Does hub height or rotor swept area explain the rated capacity?
Let’s run that same OLS model for a multi variable situation with quantitative data. You notice that we pretty much use the same code as for a single variable regression, except we add some more variables if we want.
See! Multivariable linear regressions aren’t as scary as they sound.
Here the one thing that stands out for me is Condition Number is very large. Jupyter Notebooks will warn you of this as well. Even though the coefficients for the two new variables are not negative, there may be some multicollinearity happening.
To further investigate this, we should look at the variance inflation factors (VIFs) which will tell us how severe the multicollinearity is. The typically rule of thumb is if they are less than 10, then the multicollinearity isn’t something to be terribly concerned about.
And here is what the output should look like.
After looking at the VIF factors, none are larger than 10 so swept area ‘t_rsa’ and hub height ‘t_hub_height’ don’t have a strong correlation that would muddle the model to see if either influence rated capacity.
Hope this helps!