Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
5.8 The World of Mathematical Notation
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
5.6 Venturing Into the World of Mathematical Notation
Up to now we have been representing our models with word equations: Thumb = Sex + Error; and so on. All of these follow the form: DATA = MODEL + ERROR. But writing out all these words will get tiresome as we write more models, which is why statisticians have developed mathematical notation to represent these relationships. We aren’t going to teach lots of notation in this course, but we are going to teach some. It’s a great way to impress your friends and employers, and also it’s a useful way to help you think about statistics. So, why not?
Representing the Mean
You probably learned how to calculate a mean (or average) in elementary school: take a group of numbers, add them together, and then divide by the number of numbers in your group. We could represent this calculation like this:
\[\text{mean} = \frac{\text{sum of all the numbers}}{\text{number of numbers}}\]
As you can see, in the expression above we used informal notation, similar to the word equations we have been using up to this point. Assuming you know that in a fraction the fraction bar can be read as “divided by” (such that a/b means “a divided by b”) then you can see clearly what the expression above represents: the sum of a group of numbers divided by the number of numbers. Simple.
If we rewrite the informal expression in mathematical notation we might write it like this—which we admit—looks a lot more complicated:
\[\bar{Y}=\frac{\sum_{i=1}^n Y_i}{n}\]
Take a deep breath, no need to panic. It’s really pretty simple. Let’s unpack this equation piece by piece:
We use the letter \(Y\) to represent the outcome variable, and we will do this throughout the course.
Putting the bar over \(Y\) (\(\bar{Y}\)) simply means the “mean of \(Y\).” This is pronounced “Y bar.”
The Greek capital letter \(\Sigma\) (pronounced “sigma”) means sum, and should be read as “the sum of…”
\(n\) is the letter used to represent the size of a sample.
And finally, we use the subscript \(i\) to represent each individual observation in our sample, starting from observation \(1\) and counting up to observation \(n\).
So, in the equation above, the expression to the right of the equal sign could be read as: the sum of all individual \(Y\)s from \(1\) to \(n\), divided by \(n\) (the number of observations).
This seems like a lot of notational baggage just to represent the mean of a sample. But, it has components that will become more and more useful as you work through the course. And, it resolves ambiguities that could come back to haunt us later if we just use informal word equations. For example, if we say “sum of all the numbers,” which numbers, exactly, are we referring to? The outcome variable? The explanatory variable? Using the notation \(Y\) makes this clear.
As a minor detour, let’s go back to one of the properties of the mean that we previously discovered using R: if you add up the deviations of each score in a distribution from the mean of the distribution, you will get 0. We could use notation to represent that idea like this:
\[\sum_{i=1}^n (Y_i-\bar{Y})=0\]
Two Important Points About Notation
There are two things you need to understand about mathematical notation—things that most students never fully appreciate but mathematicians know well. First, there is no single correct way to use mathematical notation. So when you pick up multiple textbooks in statistics, they often will have different formulas for things like the mean and the standard deviation, as well as for everything else. You need to be flexible and not get flustered if you see different expressions that basically mean the same thing.
Just as an example, many people would rewrite the equation for mean like this:
\[\bar{Y}=\frac{\sum{Y_i}}{n}\]
At some point, people get tired of writing the whole “i = 1 to n” part, and just agree to drop it because we all know what this means without it: sum up all the individual Ys and divide by n. This is all part of the living world of mathematical notation.
Because teachers want you to use notation consistently, they often will tell students to use the exact notation used in the particular textbook being used. But don’t get fooled by this: other books will write things differently. You need to be able to look at different versions of an expression or formula and see that they are really the same thing. (But yes, you also probably need to remember the one version your teacher expects you to use.)
The other thing about mathematical notation is that there are two different ways of interpreting it. Most students look at notation (e.g., the equation for the mean, above) as a step-by-step recipe for how to produce an answer. So, to find the mean, take all the numbers, add them up, and then divide by \(n\). But there’s a far more powerful way to look at mathematical notation: it is a representation of a quantity or relationship.
To take a simple example, the notation 2+3 could be seen as a recipe: take the number 2, add 3 more, and you’ll get the answer. But it also could be seen as a quantity in and of itself, the quantity (2+3).
You may not think it makes much difference, but what if we give you instead the expression \((x+3)\). If you see this as a recipe, and if you don’t know what \(x\) is, you are stuck. But if you look at it as representing the quantity \((x+3)\) then you don’t need to know the value of \(x\). You can still think of it as a quantity that is 3 greater than \(x\), regardless of whether you can compute it.
In statistics, if you think of notation in this second way, as a representation of quantities and relationships, you will end up learning and understanding more. So, when you see an expression that represents the mean, don’t think of it as a handy set of instructions for how to calculate the mean. And don’t start calculating the mean. You really don’t need to do that anyway; your computer will calculate the mean for you (think favstats()
), and you don’t need to tell it how!
Instead, think of it as a representation of an important relationship that defines what the mean is. When you see an equation, think about what it means, look for patterns, and so on.