Use one of these functions as appropriate, assuming that
x and y are vectors:mean(x)
median(x)
sd(x)
var(x)
cor(x, y)
cov(x, y)
When I first opened the documentation for R, I began searching for material called something like “Procedures for Calculating Standard Deviation.” I figured that such an important topic would likely require a whole chapter.
It’s not that complicated.
Standard deviation and other basic statistics are calculated by simple functions. Ordinarily, the function argument is a vector of numbers, and the function returns the calculated statistic:
> x <- c(0,1,1,2,3,5,8,13,21,34) > mean(x) [1] 8.8 > median(x) [1] 4 > sd(x) [1] 11.03328 > var(x) [1] 121.7333
The
sd function calculates the sample standard deviation, and var calculates the sample variance.The
cor and cov functions can calculate the correlation and covariance, respectively, between two vectors:> x <- c(0,1,1,2,3,5,8,13,21,34) > y <- log(x+1) > cor(x,y) [1] 0.9068053 > cov(x,y) [1] 11.49988
All these functions are picky about values that are not available (NA). Even one NA value in the vector argument causes any of these functions to return NA, or even halt altogether with a cryptic error:
> x <- c(0,1,1,2,3,NA) > mean(x) [1] NA > sd(x) [1] NA
It’s annoying when R is that cautious, but it is the right thing to do. You must think carefully about your situation. Does an NA in your data invalidate the statistic? If yes, then R is doing the right thing. If not, you can override this behavior by setting
na.rm=TRUE, which tells R to ignore the NA values:> x <- c(0,1,1,2,3,NA) > mean(x, na.rm=TRUE) [1] 1.4 > sd(x, na.rm=TRUE) [1] 1.140175
A beautiful aspect of
mean and sd is that they are smart about data frames. They understand that each column of the data frame is a different variable, so they calculate their statistic for each column individually. This example calculates those basic statistics for a data frame with three columns:> print(dframe)
small medium big
1 0.6739635 10.526448 99.83624
2 1.5524619 9.205156 100.70852
3 0.3250562 11.427756 99.73202
4 1.2143595 8.533180 98.53608
5 1.3107692 9.763317 100.74444
6 2.1739663 9.806662 98.58961
7 1.6187899 9.150245 100.46707
8 0.8872657 10.058465 99.88068
9 1.9170283 9.182330 100.46724
10 0.7767406 7.949692 100.49814
> mean(dframe)
small medium big
1.245040 9.560325 99.946003
> sd(dframe)
small medium big
0.5844025 0.9920281 0.8135498Notice that
mean and sd both return three values, one for each column defined by the data frame. (Technically, they return a three-element vector in which the names attribute is taken from the columns of the data frame.)The
var function understands data frames, too, but it behaves quite differently from mean and sd. It calculates the covariance between the columns of the data frame and returns the covariance matrix:> var(dframe)
small medium big
small 0.34152627 -0.21516416 -0.04005275
medium -0.21516416 0.98411974 -0.09253855
big -0.04005275 -0.09253855 0.66186326Likewise, if
x is either a data frame or a matrix, cor(x) returns the correlation matrix and cov(x) returns the covariance matrix:> cor(dframe)
small medium big
small 1.00000000 -0.3711367 -0.08424345
medium -0.37113670 1.0000000 -0.11466070
big -0.08424345 -0.1146607 1.00000000
> cov(dframe)
small medium big
small 0.34152627 -0.21516416 -0.04005275
medium -0.21516416 0.98411974 -0.09253855
big -0.04005275 -0.09253855 0.66186326Alas, the
median function does not understand data frames. To calculate the medians of data frame columns, use the lapply function to apply the median function to each column separately.This short, concise book provides beginners with a selection of how-to recipes to solve simple problems with R. Each solution gives you just what you need to know to get started with R for basic statistics, graphics, and regression. These solutions were selected from O'Reilly's R Cookbook, which contains more than 200 recipes for R that you'll find useful once you move beyond the basics.




Help






