Jump to content

How to Compute Basic Statistics with R

+ 1
  chco's Photo
Posted Apr 05 2011 01:56 AM

Sometimes, when using R, you want to calculate basic statistics: mean, median, standard deviation, variance, correlation, or covariance. The following excerpt from the O'Reilly publication 25 Recipes for Getting Started with R will help.

Use one of these functions as appropriate, assuming that x and y are vectors:

  • mean(x)

  • median(x)

  • sd(x)

  • var(x)

  • cor(x, y)

  • cov(x, y)


When I first opened the documentation for R, I began searching for material called something like “Procedures for Calculating Standard Deviation.” I figured that such an important topic would likely require a whole chapter.

It’s not that complicated.

Standard deviation and other basic statistics are calculated by simple functions. Ordinarily, the function argument is a vector of numbers, and the function returns the calculated statistic:

> x <- c(0,1,1,2,3,5,8,13,21,34)
> mean(x)
[1] 8.8
> median(x)
[1] 4
> sd(x)
[1] 11.03328
> var(x)
[1] 121.7333



The sd function calculates the sample standard deviation, and var calculates the sample variance.

The cor and cov functions can calculate the correlation and covariance, respectively, between two vectors:

> x <- c(0,1,1,2,3,5,8,13,21,34)
> y <- log(x+1)
> cor(x,y)
[1] 0.9068053
> cov(x,y)
[1] 11.49988



All these functions are picky about values that are not available (NA). Even one NA value in the vector argument causes any of these functions to return NA, or even halt altogether with a cryptic error:

> x <- c(0,1,1,2,3,NA)
> mean(x)
[1] NA
> sd(x)
[1] NA



It’s annoying when R is that cautious, but it is the right thing to do. You must think carefully about your situation. Does an NA in your data invalidate the statistic? If yes, then R is doing the right thing. If not, you can override this behavior by setting na.rm=TRUE, which tells R to ignore the NA values:

> x <- c(0,1,1,2,3,NA)
> mean(x, na.rm=TRUE)
[1] 1.4
> sd(x, na.rm=TRUE)
[1] 1.140175



A beautiful aspect of mean and sd is that they are smart about data frames. They understand that each column of the data frame is a different variable, so they calculate their statistic for each column individually. This example calculates those basic statistics for a data frame with three columns:

> print(dframe)
       small    medium       big
1  0.6739635 10.526448  99.83624
2  1.5524619  9.205156 100.70852
3  0.3250562 11.427756  99.73202
4  1.2143595  8.533180  98.53608
5  1.3107692  9.763317 100.74444
6  2.1739663  9.806662  98.58961
7  1.6187899  9.150245 100.46707
8  0.8872657 10.058465  99.88068
9  1.9170283  9.182330 100.46724
10 0.7767406  7.949692 100.49814
> mean(dframe)
    small    medium       big 
 1.245040  9.560325 99.946003 
> sd(dframe)
    small    medium       big 
0.5844025 0.9920281 0.8135498



Notice that mean and sd both return three values, one for each column defined by the data frame. (Technically, they return a three-element vector in which the names attribute is taken from the columns of the data frame.)

The var function understands data frames, too, but it behaves quite differently from mean and sd. It calculates the covariance between the columns of the data frame and returns the covariance matrix:

> var(dframe)
small medium big
small 0.34152627 -0.21516416 -0.04005275
medium -0.21516416 0.98411974 -0.09253855
big -0.04005275 -0.09253855 0.66186326



Likewise, if x is either a data frame or a matrix, cor(x) returns the correlation matrix and cov(x) returns the covariance matrix:

> cor(dframe)
small medium big
small 1.00000000 -0.3711367 -0.08424345
medium -0.37113670 1.0000000 -0.11466070
big -0.08424345 -0.1146607 1.00000000
> cov(dframe)
small medium big
small 0.34152627 -0.21516416 -0.04005275
medium -0.21516416 0.98411974 -0.09253855
big -0.04005275 -0.09253855 0.66186326



Alas, the median function does not understand data frames. To calculate the medians of data frame columns, use the lapply function to apply the median function to each column separately.

Cover of 25 Recipes for Getting Started with R
Learn more about this topic from 25 Recipes for Getting Started with R. 

This short, concise book provides beginners with a selection of how-to recipes to solve simple problems with R. Each solution gives you just what you need to know to get started with R for basic statistics, graphics, and regression. These solutions were selected from O'Reilly's R Cookbook, which contains more than 200 recipes for R that you'll find useful once you move beyond the basics.

Learn More Read Now on Safari


Tags:
0 Subscribe


0 Replies