Jump to content

Data Analysis: Understanding Simpson's Paradox

0
  janert's Photo
Posted Jan 31 2011 12:20 AM

Look at the table below which shows applications and admissions to a fictional college in terms of the applicants' gender and department.

MaleFemaleOverall
Department A80/100 = 0.89/10 = 0.989/110 = 0.81
Department B5/10 = 0.560/100 = 0.665/110 = 0.59
Total85/110 = 0.7769/110 = 0.63


If you look only at the bottom line with the totals, then it might appear that the college is discriminating against women, since the acceptance rate for male applicants is higher than that for female applicants (0.77 versus 0.63). But when you look at the rates for each individual department within the college, it turns out that women have higher acceptance rates than men for every department. How can that be?

The short and intuitive answer is that many more women apply to department B, which has a lower overall admission rate than department A (0.59 versus 0.81), and this drags down their (gender-specific) acceptance rate.

The more general explanation speaks of a "reversal of association due to a confounding factor." When considering only the totals, it may seem as if there is an association between gender and admission rates, with male applicants being accepted more frequently. However, this view ignores the presence of a hidden but important factor: the choice of department. In fact, the choice of department has a greater influence on the acceptance rate than the original explanatory variable (the gender). By lumping the observations for the different departments into a single number, we have in fact masked the influence of this factor—with the consequence that the association between acceptance rate (which favors women for each department) and gender was reversed.

The important insight here is that such “reversal of association” due to a confounding factor is always possible. However, both conditions must occur: the confounding factor must be sufficiently strong (in our case, the acceptance rates for departments A and B were sufficiently different), and the assignment of experimental units to the levels of this factor must be sufficiently imbalanced (in our case, many more women applied to department B than to department A).

As opposed to Bigfoot, Simpson’s paradox is known to occur in the real world. The example in this section, for instance, was based on a well-publicized case involving the University of California (Berkeley) in the early 1970s. A quick Internet search will turn up additional examples.

Cover of Data Analysis with Open Source Tools
Learn more about this topic from Data Analysis with Open Source Tools. 

Turning raw data into something useful requires that you know how to extract precisely what you need. With this insightful book, intermediate to experienced programmers interested in data analysis will learn techniques for working with data in a business environment. You'll learn how to look at data to discover what it contains, how to capture those ideas in conceptual models, and then feed your understanding back into the organization through business plans, metrics dashboards, and other applications.

Learn More Read Now on Safari


Tags:
0 Subscribe


0 Replies