Learning From Small Samples

"It's better to be approximately right than to be exactly wrong."
-- Warren Buffet

How much data do you need to learn something useful? We associate learning something with knowing it completely, knowing it pat, having it down. But this is a heavy-handed way to think about what it is to know something. Once we shift our perspective and see learning something as reducing our uncertainty about that thing, we can go surprisingly far with surprisingly small amounts of information.

This way of thinking about learning is not an idea of mine but one that I read about in Douglas Hubbard's excellent book, How to Measure Anything. Hubbard's book also covers the trick for estimating medians that I outline here.

A Business Decision

Should I market my product in Russia? My advisors tell me I should only if the median population of the large cities in Russia is somewhere between 150,000 and 250,000. If the median is outside that range, then I shouldn't.

Even without calculating a thing, notice the way your advisors have set up your decision. Instead of giving you the traditional "spot value", they've given you a range of acceptable values: "Yes" if the median is inside the band and "No" if the median is outside the band.

Important Point: For any decision based on data, it pays to know up front whether the decision depends on determining whether a quantity falls within a "range" or if the quantity is above or below a given threshold. Keep this distinction in mind between threshold-based decisions and range-based decisions as you read on.

The Complete Data Set

I'm going to show you the complete data set right up front -- the populations of all 79 large cities in Russia as defined by the Wolfram Research database. Cross checking some of these numbers with those on Wikipedia, it appears that the ones below are out of date. But they will do fine for our purposes. We're going to use this data set as a reference in what follows.

Figure 1: Population of large Russian cities

If we look at how the populations of the 79 large cities in Russia are distributed, we see that they don't look anything like the bell curve we're so used to seeing.

Figure 2: Distribution of population in Russia's largest cities (the x axis is truncated)

Not all data points are shown in figure 2 -- Moscow has more than 10.5 million inhabitants and Saint Petersburg has more than 4.6 million inhabitants; they both disappear beyond the right hand side horizon of figure 2.  We see that most of the large cities in Russia have populations of 500,000 or lower. In fact, the median population is 405,618.

The Available Information

Now that you've seen the full data, put it aside and pretend that you've forgotten what you've seen. We'll come back to it to check our predictions.

You need to make your business decision. The only information you have is a randomly selected group of 7 cities from the data set and their populations. Here they are:

{325,945; 413,068; 308,455; 509,010; 1,343,839; 1,271,045; 468,459}

Based on just these 7 values (less than 10% of the data set), what can you conclude about the median population of all the large cities in Russia?

Bold Conclusion: The median of the data set as a whole is somewhere between the lowest value 308,455 and the highest value 1,343,839.

This doesn't seem like much -- it's quite a large spread between those values that contain the median. But nevertheless, peering behind the curtain at the full data set, we've managed to narrow down the range from what is used to be: 100,020 to 10,563,038. And we used less than 10% of the data set to arrive at this. If you think of the populations in the entire data set as points on a number line, then using just 7 data points we've been able to arrive at a factor of 10 reduction in our uncertainty!

The conclusion is bold. What makes it a good one? How well can it be defended?

By definition, the median is the value that divides the data set into half. One half contains all the cities whose populations are lower than the median and the other half contains cities whose populations are all higher than the median. To keep things simple, if a city has a population that falls right on the median, we'll put it into the half whose populations are lower than the median.

Seen this way, randomly choosing 7 cities from our data set is like tossing a fair coin. We should expect roughly half our picks to be cities with populations below the median (Heads) and roughly half our picks to be cities with populations above the median (Tails).

Notice that this holds no matter what the distribution of the city populations turns out to be -- it can be Gaussian, it can be Log Normal, or whatever. A method that can be used no matter what the distribution of the population is one to be celebrated indeed!

Here's a simulation of 7 tosses of a fair coin repeated 50,000 times.

Figure 3: Number of trials in which we get a total of 0, 1, 2, ..., 7 Heads in 7 tosses of a fair coin (total of 50,000 trials)

As you'd expect, the number of heads is either 3 or 4 in the vast majority of the trials. But the interesting bumps in figure 3 are not the ones in the middle but the ones at the two ends. According to figure 3, in 380 of 50,000 trials we got no heads at all; and in 378 of 50,000 trials we got all heads.

Now why does that matter?  The bold conclusion depends on having at at least one Head and one Tail in our series of 7 tosses. But it looks like in 380 + 378 = 758 of 50,000 trials, we either got no Heads or we get all Heads. This means that in 758 of the 50,000 trials, our random sample resulted in populations that were either all above the median or all below the median. In these cases, the Bold Conclusion will be false. Specifically, about 1.16% of the time, the Bold Conclusion will be false. Or, flipping this around, the Bold Conclusion will be true 100 - 1.16 = 98.84% of the time.

[Note: Yes, this can be easily worked out from the Binomial distribution. But I didn't do it this way because I prefer the simulation rather than using a formula from a statistics book. More on this in another post.]

So we can amend the Bold Conclusion as follows:

Less Bold But Still Remarkable (and More Precise) Conclusion: With about 99% certainty, the median population of all the large cities in Russia is between 308,455 and 1,343,839.

The Remarkable and Precise Conclusion is quite remarkable. But what exactly does it mean to say that we're 99% certain that the median city population is between 308,455 and 1,343,839?

What Does it Mean to be 99% Certain?

It's pretty clear what it means to say that when I toss a fair coin 7 times I can be 99% certain that I'll get at least one Head and one Tail. We can just repeat the tosses as we did in figure 3 and calculate the proportion of the results. But we can't do this for our data set of the largest cities in Russia. There's only one such data set -- we can't say that if we took 100 such data sets then in 99 of them the median will be between 308,455 and 1,343,839. So how can we make sense of this seemingly more precise conclusion?

Given that we know the real answer (the median is 405,618), we can say that if we were to take 100 random samples of 7 cities from the data set, 99 of those samples will contain the real median of the data set. This is easy to verify by simulating these trials and it is indeed true. And even if we didn't know the real answer (which we typically don't -- otherwise we wouldn't be trying to estimate it), the reasoning behind the coin tosses seems quite convincing.

An Inconsistency?

But you might still have a nagging doubt. What if a friend also randomly sampled 7 cities? The range of numbers in your friend's sample may or may not overlap with the range you have. But by the same reasoning you used, your friend would also be 99% certain that the median was inside her range of values. We'll call your sample of 7 values A and your friend's sample of 7 values B. There are six ways in which your values may or may not overlap with your friend's values from her sample of 7 cities.

Figure 4: Ways in which ranges of values chosen by A and B may or may not overlap

In the first full overlap possibility (third box from left in figure 4), A's range is greater than B's range. A is 99% certain that his range contains the median; B is 99% certain that her range contains the median. If B is correct, then A is automatically correct; but if A is correct, it doesn't follow that B is correct in her statement of 99% certainty. So who is right? Can they both be right?

This problem become more acute in the partial overlap possibilities. It's possible for one person to be wrong. And when there's no overlap as in the two rightmost boxes in figure 4, one person must be wrong.

How frequently do these possibilities occur? It's easy to simulate that and when we do, we find that when any two friends pick a random sample of 7 cities, the populations will partially overlap about 48.98% of the time and not overlap at all about 0.02% of the time. All of the overlap scenarios should make us question our understanding of the 99% certain clause in the Less Bold But Still Remarkable claim above.

Now 0.02% is a small number (it's 2 hits per 10,000) but still -- the way we've been making sense of the 99% certainty is by putting it in terms of experiments we could in principle repeat, or in practice simulate. Having no overlap between two ranges is a guarantee that one person who is 99% certain according to the reasoning above is actually wrong! What gives?


Truth be told, I don't know what gives. When you scratch the surface of probabilistic reasoning, you regularly run into seeming inconsistencies and paradoxes. We saw something like this happen when we looked at the puzzle of rare events: sometimes there are interesting gaps between what the theory says is true and what might be true.

For the moment, we weren't able to make our decision about marketing our product in Russia, but even with 7 samples we were able to learn something useful about the median population of the cities and something even more useful about how probabilities can get slippery quickly.