CourseKata - 7.8 Measures of Effect Size

7.8 Measures of Effect Size

One question you might be asked about a statistical model is: How big is the effect? If you are modeling the effect of one variable on another (e.g., sex on thumb length), it’s natural for someone to ask this question.

This leads us to a discussion of effect size, and how to measure it. We haven’t used the term effect size up to now, but we have, in fact, presented two measures of effect size. Here we will review these two measures, and add a third.

Mean Difference (aka \(b_{1}\))

The most straightforward measure of effect size in the context of the two-group model is simply the actual difference in means between the two groups on the outcome variable.

In our data set Fingers, we can see that the size of the sex effect is 6.447 mm: males, on average, have thumbs that are 6.447 mm longer than females.

R: The `b1()` Function

The parameter estimates in the question above came from running lm(Thumb~Sex, data=Fingers). Another, more direct, way to find the \(b_{1}\) estimate (in this case the mean difference between the two groups) is to use the b1() function (part of the supernova R package). Run the code in code window below to see how this works.

require(coursekata)

# run this code
b1(Thumb ~ Sex, data = Fingers)

b1(Thumb ~ Sex, data = Fingers)

ex() %>% check_or(
    check_function(., "b1") %>%
    check_result() %>%
    check_equal,
override_solution(., 'b0(Thumb ~ Sex, data = Fingers)') %>%
check_function("b0") %>%
check_result() %>%
check_equal()
)

CK Code: ch7-24

Note that when you run the b1() function on the Sex model it returns 6.447, which is the difference in average thumb length between males and females. It also is the parameter estimate for \(b_{1}\). Try changing the b1 to b0 and run it again. This returns the estimate for \(b_{0}\), which in this case is the mean for females.

PRE

PRE is a second measure of effect size. As just discussed, it tells us the proportional reduction in error of the two-group model over the empty model. PRE is a nice measure of effect size because it is relative: it is a measure of improvement (reduction in error) that results from adding in the explanatory variable. But what counts as a good PRE?

Recall the Tiny_Sex_model had a PRE of .66 while the Sex_model had a PRE of .11. Are PREs in general going to be as large as Tiny_Sex_model? Probably not. In TinyFingers we stacked the deck for purposes of teaching—creating a data set in which all the females had smaller thumbs than all the males. This resulted in a large PRE.

As with every other statistic, PRE will vary from model to model and situation to situation. Having more experience with making models will give you a sense of what counts as an impressive PRE in your research area.

In the social sciences, at least, there are some generally agreed-on ideas about what is considered a strong effect. A PRE of .25 is considered a pretty large effect, .09 is considered medium, and .01 is considered small. So according to these conventions, there is a medium effect of sex on thumb length in the Fingers data set.

Take these conventions with a grain of salt, though, because effect size ultimately depends on your purpose. For example, if an online retailer found a small effect of changing the color of their “buy” button (e.g., PRE = .01), they might want to do it even though the effect is small. The change is free and easy to make and it might result in a tiny increase in sales.

R: The `PRE()` Function

Similar to the b1() function, the PRE() function will return the PRE for a model. In the code window below run supernova(Thumb~Sex, data=Fingers). Look in the supernova table to see what the PRE is for the Sex model.

require(coursekata)

# run this code
Sex_model <- lm(Thumb ~ Sex, data = Fingers)
supernova(Sex_model)

# now replace supernova with PRE and run again (then Submit)

PRE(Thumb ~ Sex, data = Fingers)

ex() %>% check_function("PRE") %>% check_result() %>% check_equal()

CK Code: ch7-25

Now change the code to run PRE(Thumb~Sex, data=Fingers). It should return just the PRE, the same one you found in the complete supernova table. Later, when we introduce the concept of sampling distributions, you will see how useful these new functions, such as b1() and PRE(), can be.

Cohen’s d

A third measure of effect size that applies especially to the two-group model (such as the Sex model) is Cohen’s d. Cohen’s d is related to the concept of z score. Recall that z scores tell us how far an individual score is from the mean of a distribution in standard deviation units. Cohen’s d, similarly, indicates the size of a group difference in standard deviation units.

\[d=\frac{\bar{Y}_{1}-\bar{Y}_{2}}{s}\]

Calculating Cohen’s d: What Standard Deviation to Use?

If you try to calculate Cohen’s d, you will soon encounter a question: Which standard deviation should you use? The standard deviation of the outcome variable for the whole sample, or the standard deviation of the outcome variable within the two groups? If the group means are far apart, the standard deviation for the whole sample might be considerably larger than the standard deviation within each of the groups.

In general we will use the standard deviation within groups when calculating Cohen’s d, because this is our best estimate of the variation in outcome that is not accounted for by our model. But because the standard deviations will often differ from group to group, we need a way of combining two standard deviations into one.

One possibility is just to average the two standard deviations. This works fine if the two groups are the same size. But imagine a situation where the sample size in one group is 100, but in the other only 10. If we average the two standard deviations we are assigning the same importance to the estimate of standard deviation based on the group of 10 as to the estimate based on the group of 100, which doesn’t seem right. We should have more confidence in the estimate based on the larger sample.

The solution is to use a weighted average of the two standard deviations, weighted by the number of degrees of freedom in each group. This is often referred to as the “pooled standard deviation,” and this is what we use to calculate Cohen’s d. The pooled standard deviation is calculated like this:

\[s_{pooled}=\frac{d\!f_1s_1+d\!f_2s_2}{d\!f_1+df_2}\]

(Another way to find the pooled standard deviation is to simply take the square root of the Mean Square Error on your supernova table. Mean Square Error, recall, is simply the Sum of Squares Error divided by the degrees of freedom for error, which yields a variance estimate based only on the variation within groups.)

We can improve the formula for Cohen’s d by replacing \(s\) with \(s_{pooled}\):

\[d=\frac{\bar{Y}_{1}-\bar{Y}_{2}}{s_{pooled}}\]

R: The `cohensD()` function

As with everything else in this class, there is an R function for calculating Cohen’s d.

cohensD(Thumb ~ Sex, data = Fingers)

Try running this code in the code window.

require(coursekata)

# run this code
cohensD(Thumb ~ Sex, data = Fingers)

cohensD(Thumb ~ Sex, data = Fingers)

ex() %>% check_function("cohensD") %>% check_result() %>% check_equal()

CK Code: ch7-15

[1] 0.7815688

We know that there is a 6.447 mm difference between male and female thumb lengths on average. If you think about one standard deviation as a unit of measurement, then you can see that the observed 6.447 mm difference is a little less than one of these standard-deviation units (.78 to be exact!).

With something like thumb length, knowing there is about a 6 mm difference is actually pretty meaningful. But for other variables such as Kargle and Spargle scores, people may not be as clear what a straight point difference implies.

Nevertheless, in both cases, it is somewhat illuminating to add the information from Cohen’s d to the mix. Male thumbs are .78 standard deviations longer than female thumbs.

Video Transcript

7.7 Comparing Two Models: Proportional Reduction in Error 7.9 Modeling the DGP