Statistics and Data Science: A Modeling Approach
2.1 From Numbers to Data
Although we can apply statistical techniques to any bunch of numbers, and sometimes we do just make up numbers to analyze (more about that later), we generally want to analyze numbers that represent something about the world. These numbers we refer to as data.
Measurement and Sampling
Data are the result of two fundamental processes in statistics: measurement and sampling. Measurement is the process by which we represent some attribute of an object with a number or place it into a category. For example, although no two people are alike, we can take one attribute of people, such as their height, and measure it (in inches, for example).
Two people of the exact same height may be different from each other in almost every other way. But still, we can attach a number to each of them that represents their height and say that, at least in terms of this attribute, they are the same. We all find this offensive at times—no one likes to be summarized by a number that represents a single attribute. But, we can all agree that it is sometimes useful to measure things, including people!
If the objects we study (sometimes called cases or research units) are people, then sampling is the process by which we choose which people to study, since we obviously can’t study all people. People would be one example of cases we could study, but our cases could be countries or families or mice—anything we can take a sample of and then measure in order to produce our data. Remember, we are analyzing data, and doing statistics, in order to understand variation. So, we will want to apply our measurements to more than one case—a sample of cases.
Where Data Come From
How we choose a sample of cases, and what measures we apply to that sample of cases, are decisions usually made with a specific research question in mind, though not always. We might measure the heights of a sample of people because we want to plan the height of a doorway, making sure that almost no one would bump their head walking through.
But sometimes, especially in this age of the internet, we find data that were collected for no particular purpose, or for someone else’s purpose. In that case we may think of a question after the fact that we could answer with the data. So sometimes people start with a question and try to find or collect data to help answer it, whereas other times they start with some data and go in search of a question or purpose that the data could address.
Two different researchers might collect the same measures but for different purposes. For example, a hospital and a dating website might both collect data on heights and weights (of patients and users, respectively). What might their purposes be?
Hospitals are interested in improving health and treating sick people. So, a hospital might track weight because it is related to other health measures, or because changes in weight might indicate an improving or worsening condition. A dating website might be interested in how people use these features to decide whether they want to interact with another person.
A hospital might actually measure height/weight objectively, using a scale, measuring tape, or ruler. But a dating website would most likely just ask users to self-report their height and weight. The hospital measurements might be a lot closer to the truth than the measurements found on a dating website!
The folks in these two samples might be very different. The hospital’s data typically come from people who are sick—after all, those are the people who are most likely to go to the hospital. They might also be older, because older people are also more likely to be sick. The dating website sample might be younger and healthier.
Organizing Data in a Data Frame
In statistics, we generally organize data into rows and columns. Here is a small example of a set of data organized this way.
Condition Age Wt Wt2 1 Uninformed 35 136 135.8 2 Uninformed 45 162 161.8 3 Informed 52 117 116.8 4 Informed 29 184 182.8 5 Uninformed 38 134 136.6 6 Informed 39 189 183.2
The rows represent the cases sampled. In this example, the rows represent housekeepers from different hotels. There are six rows, so six housekeepers are in this data set. Depending on the study, rows could be people, states, couples, mice—any cases you take a sample of in order to study.
The columns represent variables, or the attributes of each case that were measured. In this study, the housekeepers were either informed or not that their daily work of cleaning hotel rooms was equivalent to getting adequate exercise for good health. So one of the variables is Condition—whether they were informed of this fact or not.
There are other variables such as the housekeeper’s age (Age), weight before they started the study (Wt), and their weight at the end of the study (Wt2, measured four weeks later). Thus, the values across each row represent that particular case’s values on each of the variables measured.
Age is a single variable, which contains different values for different individuals. Here we see six values of Age for the six housekeepers in the sample. The variable Age is like a bucket; for each housekeeper, it can hold a different value.
There are four variables (Condition, Age, weight at beginning of study called Wt, and weight at end of study called Wt2) shown here in four columns.
The first column of numbers (notice it has no variable label) just numbers the rows.
Data in rows and columns are stored in R objects called data frames. Data frames include the rows and columns that contain the data. But they also include some other information, which could be thought of as metadata. The metadata includes, for example, the names of the variables (kind of like a header at the top of each column), and a row number. This means that the headers do not “count” as a row in the data frame, just as the row numbers do not “count” as a variable.