Introduction
While analyzing the relationship between two variables is useful, the real power with statistics is being able to take out analysis of the association and predicting the values of variables. In this topic we will go over linear regression models, and how we can predict the -value of an -value based on a linear association between the two variables.
Linear Regression Models
If we have two quantitative variables with an approximately linear association, we can create a linear regression model to represent their relationship (non-linear associations can still have strong correlations; hence, linear regression models of non-linear associations can work, but are flawed). In a linear regression model, we are predicting the response variable, , with the explanatory variable, . The predicted value in the model is notated as , or “y hat”. The linear regression model is written as , where is the intercept, and is the slope. The methods for getting the values of and can vary based on the type of linear regression model (The most common regression model, least squares, will be explored in topic 2.8), but will always be provided for you on the AP exam if calculations with them are necessary.
Graphing the linear regression model on top of the scatterplot for the data set will give a line that follows the trend of the data, often called the “best fit line”.
If you want to predict the value for a certain value, all you need to do is plug the value into your linear regression model and calculate. For example, let’s say a certain linear regression model between height (cm) and weight (kg) is (when you are doing a linear regression model in context of a data set, you must replace and with the names of the explanatory and response variables, and keep the hat on the response variable to convey that it is only the predicted value). To predict the weight of someone who is cm tall with this model, you first plug it into the linear regression model , then can calculate to get .
Interpolation vs. Extrapolation
When calculating a predicted value with a linear regression model, not every prediction comes with the same amount of certainty. If you are predicting matcha ice cream sales of a certain shop based on daily high temperature with a linear regression model, the predicted sales for an average temperature will be quite accurate, but predicting sales at a temperature higher than ever recorded will have less certainty, as there’s no evidence as to whether or not the trend continues. Predicting the response variable with an value within the domain of the explanatory variable is called interpolation, while predicting values outside of it is called extrapolation.
A notable example of extrapolation is when your explanatory variable is time. Let’s say you have a linear regression model of a data set where year is the explanatory variable and population is the response variable. Any prediction of population in the future would be extrapolation, as you cannot be certain if the population will continue to grow the same way, as with before the first year of your data, where you cannot be sure of how the population grew. Predicting a data point within your data, for example, you are missing the population of a certain year, is much more accurate and will have greater certainty.
