Jump to content

How to Keep Data Analysis Simple

  ggiles2's Photo
Posted Nov 22 2010 01:56 PM

By Philipp Janert, author of Data Analysis with Open Source Tools (O'Reilly Media; $39.99 USD)
More data analysis efforts seem to go bad because of an excess of sophistication rather than a lack of it.
This may come as a surprise, but it has been my experience again and again. As a consultant, I am often called in when the initial project team has already gotten stuck. Rarely (if ever) does the problem turn out to be that the team did not have the required skills. On the contrary, I usually find that they tried to do something unnecessarily complicated and are now struggling with the consequences of their own invention!

Based on what I have seen, two particular risk areas stand out:

  • The use of "statistical" concepts that are only partially understood--and given the relative obscurity of most of statistics, this includes virtually all statistical concepts; and

  • Complicated--and expensive--black-box solutions when a simple and transparent approach would have worked at least as well or better.

I strongly recommend that you make it a habit to avoid all statistical language. Keep it simple and stick to what you know for sure. There is absolutely nothing wrong with speaking of the "range over which points spread," because this phrase means exactly what it says: the range over which points spread, and only that! Once we start talking about "standard deviations," this clarity is gone. Are we still talking about the observed width of the distribution? Or are we talking about one specific measure for this width? (The standard deviation is only one of several that are available.) Are we already making an implicit assumption about the nature of the distribution? (The standard deviation is only suitable under certain conditions, which are often not fulfilled in practice.) Or are we even confusing the predictions we could make if these assumptions were true with the actual data? (The moment someone talks about "95 percent anything'' we know it's the latter!)

Don’t discard simple methods until they have been proven insufficient. Simple solutions are frequently rather effective: the marginal benefit that more complicated methods can deliver is often quite small and may be in no reasonable relation to the increased cost. More importantly, simple methods have fewer opportunities to go wrong or to obscure the obvious.

I think the primary reason for this tendency to make data analysis projects more complicated than they are is discomfort: discomfort with an unfamiliar problem space and uncertainty about how to proceed. This discomfort and uncertainty creates a desire to bring in the "big guns": fancy terminology, heavy machinery, large projects. In reality, of course, the opposite is true: the complexities of the "solution" overwhelm the original problem, and nothing gets accomplished.

Data analysis does not have to be all that hard. Although there are situations when elementary methods will no longer be sufficient, they are much less prevalent than you might expect. In the vast majority of cases, curiosity and a healthy dose of common sense will serve you well.

The attitude that I am trying to convey can be summarized in a few points:

  • Simple is better than complex.
  • Cheap is better than expensive.
  • Explicit is better than opaque.
  • Purpose is more important than process.
  • Insight is more important than precision.
  • Understanding is more important than technique.
  • Think more, work less.

Although I do acknowledge that the items on the right are necessary at times, I give preference to those on the left whenever possible.


0 Replies