Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science II
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
segmentChapter 2 - From Exploring to Modeling Variation
-
2.7 DATA = MODEL + ERROR: Notation
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
segmentChapter 9 - Models with Interactions
-
segmentChapter 10 - More Models with Interactions
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Statistics and Data Science II (XCD)
2.7 DATA = MODEL + ERROR: Notation
Up to now we have represented models using word equations and R code. Now let’s go one step further to see how we can use mathematical notation to represent the simple (empty) model.
R code computes a model fit (e.g.,
We have introduced the overarching concept that DATA = MODEL + ERROR. In our simple model, we are using a simple mathematical function, the mean, to model the distribution of home prices. The function generates a single predicted price for every home in the distribution.
We could represent this model in a word equation like this:
PriceK = Mean + Error
Although math notation sometimes feels like it’s just here to make your life harder, there are some real advantages to rewriting this statement in mathematical notation. Here’s one form this notation might take:
This equation comes from a statistical tradition called the General Linear Model (GLM). GLM equations are published in scientific articles (common in economics, biology, public health, etc.). We will use GLM notation throughout this book to help us represent and think about statistical models.
For the empty model of PriceK
, the
For now, with the empty model, PriceK
(181.4). But for other models, and other situations, it can represent other values. Indeed, this flexibility is what makes the General Linear Model general. This will come in handy later when we make more complicated models.
Does DATA Really Equal MODEL + ERROR?
The GLM notation (
Let’s take a look at a single home in Ames
that sold for $260K. In the plot below we have colored the dot representing this home black.
Using our simple model, we can decompose the price of the house we colored black into two parts, model and error. 260, therefore, can be expressed as 181.4 (the model prediction) + error. Put another way:
Subtracting the model prediction from the data point will give us the error or residual (260 - 181.4 = 78.6). In this case, the residual is positive because this particular home’s price is higher than the price predicted by the empty model.
As our models become more complex they still will generate a predicted value of