There are many ways to grab external data in R. This excerpt from Joseph Adler's R in a Nutshell will show you just how versatile R can be when it comes to gathering data.
One of the nicest things about R is how easy it is to pull in data from other programs. R can import data from text files, other statistics software, and even spreadsheets. You don’t even need a local copy of the file: you can specify a file at a URL, and R will fetch the file for you over the Internet.
Text Files
Most text files containing data are formatted similarly: each line of a text file represents an observation (or record). Each line contains a set of different variables associated with that observation. Sometimes, different variables are separated by a special character called the delimiter. Other times, variables are differentiated by their location on each line.
Delimited files
R includes a family of functions for importing delimited text files into R, based on the read.table function:
read.table(file, header, sep = , quote = , dec = , row.names, col.names, as.is = , na.strings , colClasses , nrows =, skip = , check.names = , fill = , strip.white = , blank.lines.skip = , comment.char = , allowEscapes = , flush = , stringsAsFactors = , encoding = )
The read.table function reads a text file into R and returns a data.frame object. Each row in the input file is interpreted as an observation. Each column in the input file represents a variable. The read.table function expects each field to be separated by a delimiter.
For example, suppose that you had a file called top.5.salaries.csv that contained the following text (and only this text):
name.last,name.first,team position,salary "Manning","Peyton","Colts","QB",18700000 "Brady","Tom","Patriots","QB",14626720 "Pepper","Julius","Panthers","DE",14137500 "Palmer","Carson","Bengals","QB",13980000 "Manning","Eli","Giants","QB",12916666
Notice how this data is encoded:
The first row contains the column names.
Each text field is encapsulated in quotes.
Each field is separated by commas.
To load this file into R, you would specify that the first row contained column names (header=TRUE), that the delimiter was a comma (sep=","), and that quotes were used to encapsulate text (quote="\""). Here is an R statement that loads in this file:
> top.5.salaries <- read.table("top.5.salaries.csv",
+ header=TRUE,
+ sep=",",
+ quote="\"")The read.table function is very flexible and allows you to load files with many different properties. Here is a brief description of the options for read.table.
The most important options are sep and header. You almost always have to know the field separator and know if there is a header field. R includes a set of convenience functions that call read.table with different default options for these values (and a couple of others). Here is a description of these functions.
| Function | header | sep | quote | dec | fill | comment.char |
|---|---|---|---|---|---|---|
| read.table | FALSE | \” or \’ | . | !blank.lines.skip | # | |
| read.csv | TRUE | , | \” | . | TRUE | |
| read.csv2 | TRUE | ; | \” | , | TRUE | |
| read.delim | TRUE | \t | \” | . | TRUE | |
| read.delim2 | TRUE | \t | \” | , | TRUE |
In most cases, you will find that you can use read.csv for comma-separated files or read.delim for tab-delimited files without specifying any other options. (Except, I suppose, if you are in Europe, and you use commas to indicate the decimal point in numbers. Then you can use read.csv2 and read.delim2.)
As another example, suppose that you wanted to analyze some historical stock quote data. Yahoo! Finance provides this information in an easily downloadable form on its website; you can fetch a CSV file from a single URL. For example, to fetch the closing price of the S&P 500 index for every month between April 1, 1999, and April 1, 2009, you could use the following URL: http://ichart.financ...g=m&ignore=.csv.
Conveniently, you can use a URL in place of a filename in R. This means that you could load this data into R with the following expression:
> sp500 <- read.csv(paste("http://ichart.finance.yahoo.com/table.csv?",
+ "s=%5EGSPC&a=03&b=1&c=1999&d=03&e=1&f=2009&g=m&ignore=.csv"))
> # show the first 5 rows
> sp500[1:5,]
Date Open High Low Close Volume Adj.Close
1 2009-04-01 793.59 813.62 783.32 811.08 12068280000 811.08
2 2009-03-02 729.57 832.98 666.79 797.87 7633306300 797.87
3 2009-02-02 823.09 875.01 734.52 735.09 7022036200 735.09
4 2009-01-02 902.99 943.85 804.30 825.88 5844561500 825.88
5 2008-12-01 888.61 918.85 815.69 903.25 5320791300 903.25We will revisit this example in the next section.
If you’re trying to load a really big file, you might find that loading the file takes a long time. It can be very frustrating to wait 15 minutes for a file to load, only to discover that you have specified the wrong separator. A useful technique for testing is to only load a small number of rows into R. For example, to load 20 rows, you would add nrows=20 as an argument to read.table.
Many programs can export data as text files. Here are a few tips for creating text files that you can easily read into R:
For Microsoft Excel spreadsheets, you can export them as either comma-delimited files (CSV files) or tab-delimited files (TXT files). When possible, you should specify Unix-style line delimiters, not MS-DOS line delimiters. (MS-DOS files end each line with “\n\r,” while Unix-style systems end lines with “\n.”) There are two things to think about when choosing between CSV and TXT files.
CSV files can be more convenient because (by default) opening these files in Windows Explorer will open these files in Microsoft Excel. However, if you are using CSV files, then you must be careful to enclose text in quotes if the data contains commas (and, additionally, you must escape any quotation marks within text fields). Tab characters occur less often in text, so tab-delimited files are less likely to cause problems.
If you are exporting data from a database, consider using a GUI tool to query the database and export the results. It is possible to use command-line scripts to export data using tools like sqlplus, pgsql, or mysql, but doing so is often tricky.
Here are a few options that I have tried. If you are using Microsoft Windows, a good choice is Toad for Data Analysts (available from http://www.toadsoft....a/tdaindex.html); this will work with many different databases. If you are exporting from MySQL, MySQL Query Browser is also a good choice; versions are available for Microsoft Windows, Mac OS X, and Linux (you can download it from http://dev.mysql.com...-tools/5.0.html). Oracle now produces a free multi-platform query tool called SQL Developer. (You can find it at http://www.oracle.co...oper/index.html.)
Fixed-width files
To read a fixed-width format text file into a data frame, you can use the read.fwf function:
read.fwf(file, widths, header = , sep = , skip = , row.names, col.names, n = , buffersize = , ...)
Here is a description of the arguments to read.fwf.
Note that read.fwf can also take many arguments used by read.table, including as.is, na.strings, colClasses, and strip.white.
| Using Other Languages to Preprocess Text Files |
R is a very good system for numerical calculations and data visualization, but it’s not the most efficient choice for processing large text files. For example, the U.S. Centers for Disease Control and Prevention publishes data files containing information on every death in the United States (see http://www.cdc.gov/n...statsonline.htm). These data files are provided in a fixed-width format. They are very large; the data file for 2006 was 1.1 GB uncompressed. In theory, you could load a subset of data from this file into R using a statement like this: > # data from ftp://ftp.cdc.gov/pu...lth_Statistics/ NCHS/Datasets/DVS/mortality/mort2006us.zip > mort06 <- read.fwf(file="MORT06.DUSMCPUB", + widths= c(19,1,40,2,1,1,2,2,1,4,1,2,2,2,2,1,1,1,16,4,1,1,1,1, + 34,1,1,4,3,1,3,3,2,283,2,1,1,1,1,33,3,1,1), + col.names= c("X0","ResidentStatus","X1","Education1989", + "Education2003","EducationFlag","MonthOfDeath", + "X5","Sex","AgeDetail","AgeSubstitution", + "AgeRecode52","AgeRecode27","AgeRecode12", + "AgeRecodeInfant22","PlaceOfDeath","MaritalStatus", + "DayOfWeekofDeath","X15","CurrentDataYear", + "InjuryAtWork","MannerOfDeath","MethodOfDisposition", + "Autopsy","X20","ActivityCode","PlaceOfInjury", + "ICDCode","CauseRecode358","X24","CauseRecode113", + "CauseRecode130","CauseRecord39","X27","Race", + "BridgeRaceFlag","RaceImputationFlag","RaceRecode3", + "RaceRecord5","X32","HispanicOrigin","X33", + "HispanicOriginRecode","X34") + ) Unfortunately, this probably won’t work very well. First, R processes files less quickly than some other languages. Second, R will try to load the entire table into memory. The file takes up 1.1 GB as a raw text file. Many fields in this file are used to encode categorical values that have a small number of choices (such as race) but show the value as numbers. R will convert these character values from single characters (which take up 1 byte) to integers (which take up 4 bytes). This means that it will take a lot of memory to load this file into your computer. As an alternative, I’d suggest using a scripting language like Perl, Python, or Ruby to preprocess large, complex text files and turn them into a digestible form. (As a side note, I usually write out lists of field names and lengths in Excel and then use Excel formulas to create the R or Perl code to load them. That’s how I generated all of the code shown in this example.) Here’s the Perl script that I used to preprocess the raw mortality data file, filtering out fields I didn’t need and writing the results to a CSV file: #!/usr/bin/perl
# file to preprocess (and filter) mortality data
print "ResidentStatus,Education1989,Education2003,EducationFlag," .
"MonthOfDeath,Sex,AgeDetail,AgeSubstitution,AgeRecode52," .
"AgeRecode27,AgeRecode12,AgeRecodeInfant22,PlaceOfDeath," .
"MaritalStatus,DayOfWeekofDeath,CurrentDataYear,InjuryAtWork," .
"MannerOfDeath,MethodOfDisposition,Autopsy,ActivityCode," .
"PlaceOfInjury,ICDCode,CauseRecode358,CauseRecode113," .
"CauseRecode130,CauseRecord39,Race,BridgeRaceFlag," .
"RaceImputationFlag,RaceRecode3,RaceRecord5,HispanicOrigin," .
"HispanicOriginRecode\n";
while(<>) {
my ($X0,$ResidentStatus,$X1,$Education1989,$Education2003,
$EducationFlag,$MonthOfDeath,$X5,$Sex,$AgeDetail,
$AgeSubstitution,$AgeRecode52,$AgeRecode27,$AgeRecode12,
$AgeRecodeInfant22,$PlaceOfDeath,$MaritalStatus,
$DayOfWeekofDeath,$X15,$CurrentDataYear,$InjuryAtWork,
$MannerOfDeath,$MethodOfDisposition,$Autopsy,$X20,$ActivityCode,
$PlaceOfInjury,$ICDCode,$CauseRecode358,$X24,$CauseRecode113,
$CauseRecode130,$CauseRecord39,$X27,$Race,$BridgeRaceFlag,
$RaceImputationFlag,$RaceRecode3,$RaceRecord5,$X32,
$HispanicOrigin,$X33,$HispanicOriginRecode,$X34)
= unpack("a19a1a40a2a1a1a2a2a1a4a1a2a2a2a2a1a1a1a16a4a1" .
"a1a1a1a34a1a1a4a3a1a3a3a2a283a2a1a1a1a1a33a3a1a1",
$_);
print "$ResidentStatus,$Education1989,$Education2003,".
"$EducationFlag,$MonthOfDeath,$Sex,$AgeDetail,".
"$AgeSubstitution,$AgeRecode52,$AgeRecode27,".
"$AgeRecode12,$AgeRecodeInfant22,$PlaceOfDeath," .
"$MaritalStatus,$DayOfWeekofDeath,$CurrentDataYear,".
"$InjuryAtWork,$MannerOfDeath,$MethodOfDisposition,".
"$Autopsy,$ActivityCode,$PlaceOfInjury,$ICDCode,".
"$CauseRecode358,$CauseRecode113,$CauseRecode130,".
"$CauseRecord39,$Race,$BridgeRaceFlag,$RaceImputationFlag,".
"$RaceRecode3,$RaceRecord5,$HispanicOrigin," .
"$HispanicOriginRecode\n";
}
I executed this script with the following command in a bash shell: ./mortalities.pl < MORT06.DUSMCPUB > MORT06.csv You can now load the data into R with a line like this: > mort06 <- read.csv(file="~/Documents/book/data/MORT06.csv") We’ll come back to this data set in the chapters on statistical tests and statistical models. |
Other functions to parse data
Most of the time, you should be able to load text files into R with the read.table function. Sometimes, however, you might be provided with a file that cannot be read correctly with this function. For example, observations in the file might span multiple lines. To read data into R one line at a time, use the function readLines:
readLines(con = stdin(), n = -1L, ok = TRUE, warn = TRUE, encoding = "unknown")
The readLines function will return a character vector, with one value corresponding to each row in the file. Here is a description of the arguments to readLines.
Note that you can use readLines interactively to enter data.
Another useful function for reading more complex file formats is scan:
scan(file = "", what = double(0), nmax = -1, n = -1, sep = "", quote = if(identical(sep, "\n")) "" else "'\"", dec = ".", skip = 0, nlines = 0, na.strings = "NA", flush = FALSE, fill = FALSE, strip.white = FALSE, quiet = FALSE, blank.lines.skip = TRUE, multi.line = TRUE, comment.char = "", allowEscapes = FALSE, encoding = "unknown")
The scan function allows you to read the contents of a file into R. Unlike readLines, scan allows you to read data into a specifically defined data structure using the argument what.
Here is a description of the arguments to scan.
Like readLines, you can also use scan to enter data directly into R.
Other Software
Although many software packages can export data as text files, you might find it more convenient to read their data files directly. R can read files in many other formats. Table 12-1 shows a list of functions for reading (and writing) files in other formats. You can find more information about these functions in the help files.
| File format | Reading | Writing |
|---|---|---|
| ARFF | read.arff | write.arff |
| DBF | read.dbf | write.dbf |
| Stata | read.dta | write.dta |
| Epi Info | read.epiinfo | |
| Minitab | read.mtp | |
| Octave | read.octave | |
| S3 binary files, data.dump files | read.S | |
| SPSS | read.spss | |
| SAS Permanent Dataset | read.ssd | |
| Systat | read.sysstat | |
| SAS XPORT File | read.xport |
R is rapidly becoming the standard for developing statistical software, and R in a Nutshell provides a quick and practical way to learn this increasingly popular open source language and environment. You'll not only learn how to program in R, but also how to find the right user-contributed R packages for statistical modeling, visualization, and bioinformatics.




Help









