We can’t go back to the original population, but the sample that we already have should be a fairly good approximation to the overall population. We can therefore create additional samples (also of size n) by sampling with replacement from the original sample. For each of these “synthetic” samples, we can calculate the mean (or any other quantity, of course) and then use this set of values for the mean to determine a measure of the spread of its distribution via any standard method (e.g., we might calculate its inter-quartile range).
Let’s look at an example—one that is simple enough that we can work out the analytical answer and compare it directly to the bootstrap results. We draw n = 25 points from a standard Gaussian distribution (with mean μ = 0 and standard deviation σ = 1). We then ask about the (observed) sample mean and more importantly, about its standard error. In this case, the answer is simple: we know that the error of the mean is , which amounts to 1/5 here. This is the analytical result.
To find the bootstrap estimate for the standard error, we draw 100 samples, each containing n = 25 points, from our original sample of 25 points. Points are drawn randomly with replacement (so that each point can be selected multiple times). For each of these bootstrap samples, we calculate the mean. Now we ask: what is the spread of the distribution of these 100 bootstrap means?
The data is plotted in Figure 12-4. At the bottom, we see the 25 points of the original data sample; above that, we see the means calculated from the 100 bootstrap samples. (All points are jittered vertically to minimize overplotting.) In addition, the figure shows kernel density estimates of the original sample and also of the bootstrap means. The latter is the answer to our original question: if we repeatedly took samples from the original distribution, the sample means would be distributed similarly to the bootstrap means.
(Because in this case we happen to know the original distribution, we can also plot both it and the theoretical distribution of the mean, which happens to be Gaussian as well but with a reduced standard deviation of . As we would expect, the theoretical distributions agree reasonably well with the kernel density estimated calculated from the data.)
Of course, in this example the bootstrap procedure was not necessary. It should be clear, however, that the bootstrap provides a simple method for obtaining confidence intervals even in situations where theoretical results are not available. For instance, if the original distribution had been highly skewed, then the Gaussian assumption would have been violated. Similarly, if we had wanted to calculate a more complicated quantity than the mean, analytical results might have been hard to obtain.
Let me repeat this, because it’s important: bootstrapping is a method to estimate the spread of some quantity. It is not a method to obtain “better” estimates of the original quantity itself—for that, it is necessary to obtain a larger sample by making additional drawings from the original population. The bootstrap is not a way to give the appearance of a larger sample size by reusing points!
Learn more about this topic from Data Analysis with Open Source Tools.
Turning raw data into something useful requires that you know how to extract precisely what you need. With this insightful book, intermediate to experienced programmers interested in data analysis will learn techniques for working with data in a business environment. You'll learn how to look at data to discover what it contains, how to capture those ideas in conceptual models, and then feed your understanding back into the organization through business plans, metrics dashboards, and other applications.