Jump to content

When to use Excel, when to use R?

+ 1
  michaelmilton's Photo
Posted Jan 26 2010 12:17 PM

Because I wrote a book on data analysis and am currently finishing a book on Excel and a introductory video series on R for O’Reilly (I’ll tweet the link when I have one), I get asked this question pretty frequently. Generally the person asking this question knows enough Excel to get by and has never really looked at R.

To this person I’d strongly recommend that you extend your Excel repertoire beyond list-making, highlighting, and the occasional SUM formula. My experience showing people how to use Excel suggests that for every Excel power user there are a hundred people using Excel in a minimal way. Excel has plenty of firepower, and you paid for it, so why not learn it?

You should also learn R (more specifically, the S language). It’s not the easiest language to pick up, but once you get a hang of the basic data structures, the data analytic floodgates open. Since I started using R over a year ago, data analysis has become a much more exciting, “bwa-ha-ha!”-inducing experience than it ever was before. The power of R is astonishing. I haven’t been this excited about doing data work since I printed out SuperCalc bar charts on my Epson 9-pin dot matrix printer. If you haven’t already, you should learn R.

Ok, so let’s look at the specific question. Here are a few suggestions.

When to use Excel

When you have something that needs a nice presentation. Most people use Excel as a page layout program for quantitative or list-based data. Seriously, as a page layout program, like InDesign or something.

This is probably not the primary usage that the authors of VisiCalc had in mind, but it’s a big need that people have, and the accretion of new formatting features in Excel 2007 and 2010 shows that the good folks at Microsoft recognize that their job is to give people what they want. Excel is a fast and straightforward tool for the presentation of tabular information. It would be the first tool I’d use if I wanted to present a summary of data, and I’ve even dropped graphics created in R into an Excel spreadsheet (which I then ripped to a PDF) when I wanted to create a nice presentation.

Now, people can use Excel for data presentation either well or poorly. I know that the Head First-approved Non-Designer’s Design Book was a big help for me when it comes to recognizing when I’m creating something hideous in Excel. I recommend it.

When you have quick and dirty number crunching to do.
With Excel, loading data and writing formulas is quick and easy. With R, there’s generally some configuration overhead you have to endure in order to start crunching numbers. If you need to do a small handful of descriptive stats on your data, or you need to look something up, run a quick sort/filter, or even a pivot table, Excel is the tool.

Some people never need to go beyond this sort of data work. They probably don’t need to learn R, even though I’m inclined to say that everyone needs to learn R.

When I want to eyeball data quickly and maybe run a few basic formulas, I’m happy to fire up Excel instead of R.

When to use R

When you have to explore data. At the start of an analytic project, it’s a good idea to create a bunch of graphical visualizations of your data to get a sense of what’s inside it. In terms of its graphical capabilities, R exists in a whole separate dimension from Excel. This was perhaps the most shocking part to me about using R for the first time: I really thought I had a handle on data analysis even though I’d restricted my software to Excel, but boy was I wrong. The visualizations you can create in R are much more sophisticated and much more nuanced. And, philosophically, you can tell that the visualization tools in R were created by people more interested in good thinking about data than about beautiful presentation. (The result, ironically, is a much more beautiful presentation, IMHO.)

Here’s how I’d put the difference to someone who’s familiar with Excel but not yet with R. The graphics creation options that Excel gives you are all based in the graphical user interface. This is what makes Excel relatively easy to use—all your options are laid out before you with nice buttons and fill-in-the-blank boxes. But in order to create a graphical interface that’s easy to use, the creators of Excel had to make a bunch of decisions about what sorts of graphics you are and are not likely to want. With too many choices, the graphical interface becomes cumbersome and frustrating, so to achieve simplicity they had to eliminate options.

And this isn’t a gripe or anything. I can’t say I’d have done a better job designing Excel’s charting graphical interface. I cut my teeth on it.

These limitations become a problem when you want to inspect data visually in a bunch of different ways in order to explore it. R, through a combination of its well-designed base graphics package, the exceptionally well-designed lattice graphics package, and the jaw-droppingly well-designed ggplot2 graphics package, offers a breathtaking array of visualization options that you access through the command line or scripts. It has power that you just can’t get using a graphical interface to generate your charts.

When you need to be really clear about how you change your data. Setting aside the cool bells and whistles of R, this particular angle has had the most practical significance for me. A lot of the work that I do as a consultant involves direct marketing data. I set up the data either for analysis or for print/web production, and this means that I have to mutate it quite a bit from its original form. It’s rare that I can take client data in the state they store it and use it directly without manipulation. I say “rare,” but it’s really “never.”

The way I’ve always handled the cleaning of data in Excel is to create a bunch of intermediary formulaic columns alongside my raw data columns. I work and rework these intermediaries until I finally have columns containing data that’s been “cleaned” for whatever purpose I have in mind. Then I copy and Paste Special > Values the clean data to a new sheet. That way, I have an audit of what I did to the data, in case I screwed something up, which never happens [COUGH!].

With R all the data mutation I do is now saved in little text files called scripts. I’ll have my raw data in a CSV or something and create a script that loads the data, mutates it in any way I want, then spits it out into another CSV and/or R object. The advantage of using R in this way rather than Excel is that I can be a lot more concise in terms of code and descriptive in terms of commenting about what I do. Another advantage is that I can use regular expressions in R, which one cannot use in Excel. R is worth learning so you can clean data more elegantly.

I think of the raw data > script > clean data workflow in R as similar to the RAW image > metadata manipulation > JPG image workflow in Adobe Lightroom, for those of you familiar with Lightroom. It’s nice.

When you need serious statistical capabilities. Excel has a bad reputation for statistics. Historically there have been a variety of situations where Excel has demonstrated numerical and programmatic errors that produce flat wrong answers. I haven’t seen a statistician’s review of Excel for version 2007 or later, though, so I can’t really pass judgment about Excel’s current incarnations. Let me know if you have a link to something along those lines.

Excel 2007 was, generally speaking, a big improvement over previous versions. I’m a big fan of the SUMIFS and IFERROR functions, and I really appreciate the “Big Grid” and multiprocessor support. And all indications are that Excel 2010 will be another big improvement—check out the Excel team’s very interesting blog. My impression is that the Microsoft folks are sensitive to historical problems and have taken steps to fix them. If so, this is worth applauding.

But I’ll make a last point that is really a restatement of the point I made about the overdetermination of options you get with a graphical interface. This phenomenon applies not just to graphics functions, but to statistical functions generally. The statistical functions you get in R are much more flexible, numerous, and reliable. By a very long shot. This is in large part because R uses a full-blown scripting language rather than a GUI.

Conclusion

I’m quite sure I haven’t hit everything, but these distinctions between the two programs are where I’d start for someone with familiarity with Excel and no knowledge of R. Become a formula whiz in Excel, and learn R!
Head First Data Analysis

Learn more about this topic from Head First Data Analysis.

How can you learn to manage and analyze all kinds of data? Turn to Head First Data Analysis, where you'll learn how to collect and organize your data, sort the distractions from the truth, find meaningful patterns, draw conclusions, predict the future, and present your findings to others. The unique approach in Head First Data Analysis is by far the most efficient way to learn what you need to know to convert raw data into a vital business tool.

See what you'll learn

Head First Excel

Learn more about this topic from Head First Excel.

Do you use Excel for simple lists, but get confused and frustrated when it comes to actually doing something useful with all that data? Stop tearing your hair out: Head First Excel helps you painlessly move from spreadsheet dabbler to savvy user. Whether you're completely new to Excel, or an experienced user looking to make the program work better for you, this book will help you incorporate Excel into every aspect of your workflow.

See what you'll learn


Tags:
0 Subscribe


1 Reply

0
  rubgb's Photo
Posted Jul 13 2011 04:24 PM

Hi Michael,
regarding Excel bad reputation for statistics, here's a link to an article from the Computational Statistics and Data Analysis journal about the accuracy of statistical procedures in Excel 2007:

http://www.pages.dre...5/excel2007.pdf

Regards,
Ruben