This is a discussion of what data doesn't do: that is, of the various ways in which measurement and interpretation themselves can transform data. This is not a treatise on "lies, damn lies, and statistics": we know that data can be used to purposefully confound; here, the focus is on how it can accidentally confuse. In particular:
Our tools for using data are inexact.
We process data with known biases.
Although the examples that follow focus on biomedical and financial data, they are more or less extensible. "Data" is used to mean any set of raw facts amassed from experience, observation, or experiment.
1. More Data Isn't Always Better
Statistics is a science of representation and approximation. The more of a system we capture or observe, the closer we can come to representing it honestly. An introductory statistics text will emphasize: as you increase sample size, you decrease confidence interval without any loss in confidence. In other words, more data helps rein in your margin of error.
Figure 13.4. The normal distribution.
A fine truth for the textbooks. Outside of that gossamer world, several assumptions must be examined. First, how is your data distributed? Is it necessarily normal? In much of finance, for example, distributions eschew normality. Biomedical data (the expression of a trait, for example) is more frequently Gaussian, but evolution needn't always conform to the central limit theorem.
If the data is not normal, more data will not reduce your margin of error in the expected manner. Karl Popper described an asymmetry in how we use data to answer questions: while no number of results in support of a hypothesis will ever confirm it, a single contradictory result will disprove it. More data adds only marginal certainty, whereas one instance can dissolve a century of belief.
Second, is the cost of a false positive the same as that of a false negative? Even if your data is (or looks) normal, your interest in different outcomes might not be symmetrical. The cost of failing to detect a life-threatening illness may be greater, for example, than the cost of incorrect diagnosis. In such a case, data that improves the precision of diagnosis (by cutting out false negatives) will be more useful than reams of data to winnow down false positives.
2. More Data Isn't Always Easy
Data doesn't necessarily scale. One of the trite maxims of our information age is that it's just as easy to process 10 bits as it is 10 terabytes, whereas 10 billion widgets are much more expensive to make than 10.
In some cases, the costs of cleaning and processing data are not trivial. This is particularly true when verification requires a human eye, such as reading meaning into X-rays or transcribing data coded in a questionnaire. In Red Queen fashion[15] better computers and the ability to collect more and more data has driven (and been driven by) the development of new tools to parse it and new ways to use it.
There are also cognitive costs that accompany more information. Whether we're choosing between jams at the supermarket or 401(k) plans, research has shown that as the number of options increases, it takes us longer to decide, we become more likely to give up without choosing anything, and we are less satisfied with any choice we make (Iyenger and Lepper 2000).
Finally, a subtle cost: more data can begin to blind us to other possibilities, especially if we're responsible for its collection and collation. It's hard not to imagine that seeing more data means a hypothesis is better supported—a corollary of the confirmation bias and sampling issues discussed earlier.
3. Data Alone Doesn't Explain
People explain. Correlation and causality, you may have heard, make strange bedfellows. Given two variables correlated in a statistically significant way, causality can work forward, backward, in both directions, or not at all. Statisticians have made a hobby (not to mention a number of blogs) of chronicling the abuses of correlation, like old ladies clucking at the downfall of traditional values in the modern world.
Journalists are the preferred targets of such statistical "tsks." A recent article in the Wall Street Journal (Shellenbarger 2008), for example, suggested that because premarital cohabitation is correlated with higher rates of divorce, unwed couples could avoid living together in order to improve their chances of staying together after marriage. The research described never suggested a causal link, but the journalist offered her own advice to couples based on the "data."
The substitution of correlation with causality need not be so explicit. When a scientific research project is undertaken, there exists the assumption that correlation, if discovered, would imply causation, albeit unknown. Else, why seek to answer a research question at all: large-scale search for correlation without causation is aleatory computation, not science. Even with so-called big data, science remains an intensely hypothesis-driven process.
The limits of empirical research is not grounds to throw up our hands, only to be careful to push discovery forward without getting rosy-eyed about causality. Creating stories about data is only human: it's the ability to revise consistently that makes a story sound.
4. Data Isn't Good for a Single Answer
Descriptive statistics can hide detail. The charts in Figure 13.5, for example, show four distributions that look dramatically different, yet share the same mean and variance. These two pillars of descriptive statistics—mean and variance—tell you very little about distribution (Anscombe 1973).
Figure 13.5. Anscombe's quartet: each data set has the same mean and variance.
When using data for decision-making, we tend to treat distributions as if they're good for one answer. We may need to base a binary decision—should the U.S. declare war? should the FDA approve this drug? who is predicted to win the election?—or a summary statement—how well-off are Americans? what will the Earth's climate look like in five years?—on data that's indeterminate. Even if variance is reported, the decision is what matters.
People think in terms of outcomes, not distributions. Consider a personal financial decision: how much should I invest in stocks, bonds, and cash? Even if past financial performance properly predicted future returns (which, as even the financial advisors are legally required to admit, it doesn't)—that is, even if we knew the shape of the distribution—we'd still have a number of risk and reward pairs from which to elect, and a number of possible outcomes within those distributions. With a given risk level, one's retirement could be characterized by abundance or by poverty, and it's difficult to imagine these several futures concurrently (one tends to suppose the average, or sometimes the best-case scenario—the so-called "planning fallacy").
A team of decision scientists has created an interesting tool to help investors understand the range of possibilities inherent in a distribution of outcomes (see Figure 13.6) Participants can adjust 100 "probability units" to form a distribution curve. For example, they might place all of their units at 75% of salary, or distribute it evenly among a variety of percentage levels. Then, they press go and watch as the units, one by one, disappear at random. The last one standing is the "outcome" (Goldstein et al. 2008). Thus, a level of risk is not an ambiguous distribution curve but a set of (here, 100) equally probable possibilities.
Figure 13.6. A tool by Goldstein et al. helps people understand a distribution as a set of outcomes.
Biologist Stephen Jay Gould further illuminates the problem with equating descriptive statistics with outcomes. "The Median is not the Message" is Gould's reaction to a diagnosis of cancer and the warning that he had "eight months to live." Literature on the cancer revealed a right-skewed distribution based on a "prescribed set of circumstances"—that is, a long tail of long-lived survivors, under the assumption of past treatment conditions. To claim "eight months" was to miss the bulk of the picture. As Gould elegantly characterizes the brutishness of statistics:
[E]volutionary biologists know that variation itself is nature's only irreducible essence. Variation is the hard reality, not a set of imperfect measures for a central tendency. Means and medians are the abstractions.
5. Data Doesn't Predict
Building models (to forecast tomorrow's weather, the outcome of the 2012 Super Bowl, or the fate of the Fortune 500) is a seductive art. Indeed, an important extension of science's prime venture—to explain the world around us—is to try to understand the world as it will be.
In certain domains, namely in controlled -cosms of the physical world, it is possible to predict an outcome with near certainty. Future results will track past events with high fidelity: water will turn to gas when heated; a falling object will accelerate at 9.8 meters per second squared in a vacuum; if a creature's heart stops, it's dead. Epistomologically, Popper's notion of falsifiability is never moot, but it's possible to lead a sound and social life taking the previous three assumptions to be axiomatic.
In domains with less certainty, such as human or physical behavior, modeling is an important tool to help explain patterns. In our eagerness to make data say something, however, it's possible to overfit a model.
Consider the problem of finding exoplanets with Doppler radial velocity (a process I don't pretend to understand more than superficially: basically, bright stars make it hard to see planets, so astronomers identify combinations of Doppler shifts that would only occur due to the presence of a planet orbiting the star). It's difficult to test the sensitivity of a model, but with a mere 15 observations, it's possible to fit the very sexy sinusoidal curve in Figure 13.7 to the data (Ge et al. 2004)!
When we overfit a model, it loses predictive power. Also, if we're willing to accept any model that most optimally fits existing data, without a care for its complexity or sensitivity, we make several mistakes. First, we forget causality and do data a disservice; an over-tuned model explains nothing.
Second, we forget that data (or data collection) may be limited, and that the world itself can change. Take the problem of trying to predict the world's climate 200 years from now. There are a few key pieces of evidence at high resolution over a long course of time—namely, global temperature data from the fossil record and ice cores. Climatologists can also infer local temperature and precipitation from diaries and tree rings, but with very different levels of precision: 18th-century storm glasses are not the same as 20th-century weather balloons with GPS. And, who knows if the same sets of interactions will drive climatic events in the 21st century as did in the 20th.
Figure 13.7. A model for extra-solar planet identification.
Similarly, Ford Motor Company in 1914 is not the same as Ford in 1975 nor as Ford today, yet many financial models assume the dynamics of the market's last cycle will also explain its future performance (and models make very different assumptions about the relevant time period to consider). As a result, risk analysis models may be only a little bit (acceptably) off on most days, but can break down entirely when the "unexpected" occurs (for example, when the housing market collapses).
Good scientists are aware of the dangers of a bad model, but it's not hard to be seduced by a fit that's too good to be true. Take, for example, this 2005 report from Moody's (Dwyer) discussing one unit's experience with overfit models (which it proceeded to correct, but—in hindsight—in a not nearly big enough way):
A certain amount of skepticism is appropriate when a new modeling methodology yields a large increase in power. [It] will often be the result of fitting the data collection mechanism rather than an actual underlying behavioral relationship. In addition, these issues are clear afterward, but were not turned up in the ordinary pre-modeling data-cleansing process that we had in place at the time…
At best, overfitting will introduce unneeded complexity into the model or discredit it with users. At worst, it can lead to systematic error in the risk assessment of a portfolio.
Oops.
There are many compelling reasons to build models beyond prediction, including to explore scenarios and illuminate assumptions; for an excellent enumeration, see Joshua Epstein's 2008 essay "Why Model?"
6. Probability Isn't Intuitive
This is another favorite flogging horse of the statistical establishment, and for good cause. Statisticians tirelessly devise cute games to demonstrate that a seemingly common-sense answer can fail to be probabilistically correct, and that conditional and joint probabilities are not intuitive. They are especially delighted when mathematicians and medical doctors are fooled by these games.
In a given U.S. city, about 1,000 out of 1 million (or 0.1% of) inhabitants are HIV-positive.
A new test to diagnose HIV has a 1% failure rate: one out of a hundred times, it will incorrectly diagnose an HIV-negative individual as having HIV, and 1% of the time it will incorrectly diagnose someone with HIV as HIV-negative.
Suppose an individual takes the test and is diagnosed as HIV-positive. What are the chances that he has the virus?
Many people will answer that there's a 99% chance he has the virus, because the test has a 1% failure rate. In fact, because the proportion of the population that has the disease is so small, any individual's chance of having it, even if diagnosed, is low: only 9.9%. (Of the 999,000 HIV-negative residents, 9,990 will be told they have the virus, while 990 of the HIV-positive residents will get a true positive. Given a positive diagnosis, the chances that you are in fact HIV-positive are 990/9,990, or 9.9%.)
Doctors, at least apocryphally, fail in droves.
In many situations, priors don't disappear. When using data to answer a question, we don't know what evidence to exclude and how to weight what we include. Daniel Kahneman—an indefatigable namer-of-concepts—names this one the "base rate fallacy."
7. Probabilities Aren't Intuitive
Not only is probability theory difficult to grasp, individual probabilities are fleeting. In the absence of a causal explanation to tie an event to a set of outcomes, individuals rely on past observations to estimate probabilities. And observations are often collected in a biased way (especially if they're garnered through experience, but often also when collected via experimentation), and are very difficult to document, reconcile, weight, preserve, and query.
8. The Real World Doesn't Create Random Variables
In the beginning, the earth was without form and void. Then Fisher said, "Let there be z-scores and ANOVA" and there were z-scores. And Fisher saw that regression was good, and he separated the significant from the nonsignificant.
The innovations of statistics seem so momentous that it can be difficult to keep in mind that they're not laws of nature. One can imagine an alternate universe in which the de facto threshold for statistical significance had been set (arbitrarily, as it was in our universe) at p=0.01 or p=0.06, rather than at the current p=0.05. Think of the drugs that would have been approved or rejected, the misplaced correlations between environmental variables and health effects, the piles of cash you'd be saving on auto insurance!
In our non-Fisherian world, there are no such things as independent random variables. In fact, many things are highly connected. Good experimentation controls for interdependencies insofar as is possible, but dependencies can be hard to spot. As we've learned recently, it can be a mistake to assume discrete events (a homeowner defaulting on his mortgage, for example) are independent, and to build large edifices upon such assumptions (tradeable financial products sliced into tranches, for example) when they are not necessarily so.
Prediction markets and group decision-making processes can work exceptionally well—in some cases, better than the estimates of a set of experts. They've been shown to break down, however, when information cascades and interdependencies enter the system (Bikhchandani et al. 1998).
9. Data Doesn't Stand Alone
In real-world decision-making, data comes in many forms. Rarely is information cleaned and packaged in a well-labeled spreadsheet or matrix file; instead, we often need to make conclusions based on subjective as well as quantitative information.
Take, for example, the decision of whether to lend money to someone online (for a profit, as part of an established lending marketplace). An analysis that colleagues and I conducted of loan funding and repayment using a data set of 350,000 loans from the peer-to-peer platform Prosper.com reveals that any number of models (mixture models, neural nets, decision trees, regression) can predict who will get a loan and who will repay it on time with only about 75% accuracy. A huge amount of data—including over 100 personal financial health indicators for each member of the network—can be fed into the algorithms, but there remains uncertainty about which applicants will fare well and which will fail to be funded.
Our models can be refined in part by attempting to quantify subjective features. When an individual decides whether to lend money to a member of the network, the lender (unlike a bank) takes into account a number of "softer" factors: the borrower's statement of purpose, the accompanying image, spelling, grammar, and other profile information. To incorporate some of these features into our models, I used human workers (from Amazon's Mechanical Turk) to code images from Prosper.com members, first for content—whether the image depicts a person, a family, a vehicle, etc.—and then for a "trustworthiness" score: that is, for the answer to the question, "Would you lend money to this person?"
But the models still fell short: social factors play into loan dynamics in unexpected ways. Contrary to our assumptions, lending decisions aren't made independently. Rather, there's some evidence of herding behavior in bids: lenders follow other lenders, and bids-per-unit-time accelerate as more bids accumulate on a loan.
Even with these and other social factors taken into account, many lenders make suboptimal decisions. Prosper is, in theory, a market with near-perfect information: almost anyone can access the site's API and repeat our analysis. Yet lenders continually accept a low level of return for very risky investments: a surprising number make very bad bets given statistical expected payoffs. Even with good information (and modest proxies for subjective data), decisions aren't always made directly from data, and data, in turn, can only explain human decisions in part.
10. Data Isn't Free from the Eye of the Beholder
Finally, even in realms where solid causal explanation is possible, when data is collected honestly and modeled carefully by a judicious student of Fisher and (if our pupil is so inclined) Bayes, who accounts for variation and validates his model (and still remains skeptical of its results), a couple of cognitive biases cloud our thinking. In the real world, we operate pseudoprobabilistically at best.
Just as the statisticians tend to their tsk-tsk blogs, the behavioral economists have made a field from their own chronicles of infamy. The narrative fallacy, confirmation bias, paradox of choice, asymmetry of risk-taking, base rate fallacy, and hyperbolic discounting were mentioned earlier. Psychologists have indexed many others, ranging from anchoring (overreliance on a single recent data point in making a decision) to the Lake Wobegon effect (the phenomenon of more than half of individuals in a population believing they are above average).
As these effects become better documented, we can develop tools and intuitions to help take data at face value (part of my work is focused on developing tools for financial decision-making). In some sense, the solution is simple: data doesn't do much if you don't understand its limits.
[15] Lewis Carroll's Red Queen, from Alice in Wonderland, proclaims, "It takes all the running you can do, to keep in the same place." This idea has been used to describe a system that, due to an arms race of external pressures, must continue to co-evolve.
With this insightful book, you'll learn from the best data practitioners in the field just how wide-ranging -- and beautiful -- working with data can be. Join 39 contributors as they explain how they developed simple and elegant solutions on projects ranging from the Mars lander to a Radiohead video.




Help






