Statistics and Data Science: A Modeling Approach
We have started our journey with data—what we end up with after we turn variation in the world into numbers. The process of creating data starts with sampling, and then measurement. We organize data into columns and rows, where the columns represent the variables (e.g., Thumb) that we have measured; and the rows represent the objects to which we applied our measurement (e.g., students). Each cell of the table holds a value, representing that row’s measurement for that variable (such as one student’s thumb length).
Before analyzing data, we often want to manipulate it in various ways. We may create summary variables, filter out missing data, and so on.
But let’s keep our eye on the prize: we care about variation data because we are interested in variation in the world. There is some greater population that a sample comes from. And here we see the ultimate problem with data: it won’t always look like the thing it came from. Much of statistics is devoted to understanding and dealing with this problem.