Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science II
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
segmentChapter 2 - From Exploring to Modeling Variation
-
2.8 Parameters and Estimates
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Statistics and Data Science II (XCD)
2.8 Parameters and Estimates
Having spent some time explaining the details of the empty model, and how it is constructed from a sample of data, it is important to pause and put this model based on data into context.
Although we construct models using the data we have, our ultimate goal is to model the population from which the data came, and the Data Generating Process that produced that population over time. If a model is a good model of our data, but not of the DGP, it won’t be so useful, either for understanding the DGP or predicting future events.
Why Wouldn’t Our Model of the Data Accurately Reflect the DGP?
We are encouraging you to take seriously the distinction between the model you create based on data, and the true model of the Data Generating Process. You might wonder: but why would my model be different from the true DGP?
There are two main reasons, both having to do with sampling, and both exacerbated if your sample is relatively small. One reason is that your sample may be biased. For example, in Ames
, the data are from homes in two neighborhoods (College Creek and Old Town). It’s possible that these neighborhoods are very different from the other neighborhoods in Ames; thus their prices would mislead us if we were trying to estimate the average home price for all of Ames.
The other reason is sampling variation. Even if your sample is not biased, there still is random variation from sample to sample. Because you typically only have one sample in your data set, it’s important to consider that it may, just by randomness alone, be off from what the true DGP is.
Distinguishing Parameters from their Estimates
The best prediction for a future home’s sale price in Ames would be the mean of the population, not the mean price of our small sample of 185 homes. But because we don’t know the true mean of the population, we must estimate it using the mean from our data.
For this reason, it is important to distinguish between a parameter, which is a number that summarizes something about a population or DGP, and its estimate, which is our best guess, based on data, of what the true value of the parameter is.
The empty model is sometimes called a one-parameter model because we only are estimating one parameter: the mean. Because the true mean of the population or DGP is unknown and can’t be calculated, we must estimate it. But we should not assume that our estimate is correct; the true parameter value will probably be different!
We already have introduced notation to represent the mean we calculate from data (\(b_0\)). Now let’s learn the notation we use for the mean of the DGP, which we are trying to estimate.
Model of DGP | Parameter Estimates Calculated from Sample Data |
---|---|
\(Y_{i}=\beta_{0}+\epsilon_{i}\) | \(Y_i = b_0 + e_i\) |
The equation on the left has the parameter we want to estimate (\(\beta_{0}\), pronounced “beta-sub-0”), the one on the right substitutes the estimate (\(b_0\)) for the parameter. \(\beta_0\) is the mean of the population; \(b_0\) is the estimate of that mean.
In order to help us remember that the true mean of the DGP is unknown and not the same as \(b_0\), we use different notation to refer to the parameter we are estimating: \(\beta_0\) In general, statisticians use Greek letters (e.g., \(\beta\) to represent parameters, and Roman letters (e.g., \(b\)) to represent the estimates.
When we ran this R code to fit the empty model to our Ames
data, here is what we got.
empty_model <- lm(PriceK ~ NULL, data = Ames)
empty_model
Call:
lm(formula = PriceK ~ NULL, data = Ames)
Coefficients:
(Intercept)
181.4
A “good” estimate?
The sample statistic (\(b_0\)) is probably not going to be the exact value of the population parameter (\(\beta_0\)). But it is the best estimate we can come up with based on the data we have. And, it is an unbiased estimator in that it is just as likely to be too high as it is to be too low.