Jump to content

How to Import Data from External Files in R

+ 1
  adfm's Photo
Posted Jun 16 2010 03:48 PM

There are many ways to grab external data in R. This excerpt from Joseph Adler's R in a Nutshell will show you just how versatile R can be when it comes to gathering data.


One of the nicest things about R is how easy it is to pull in data from other programs. R can import data from text files, other statistics software, and even spreadsheets. You don’t even need a local copy of the file: you can specify a file at a URL, and R will fetch the file for you over the Internet.

Text Files

Most text files containing data are formatted similarly: each line of a text file represents an observation (or record). Each line contains a set of different variables associated with that observation. Sometimes, different variables are separated by a special character called the delimiter. Other times, variables are differentiated by their location on each line.

Delimited files

R includes a family of functions for importing delimited text files into R, based on the read.table function:

read.table(file, header, sep = , quote = , dec = , row.names, col.names,

 	as.is = , na.strings , colClasses , nrows =, skip = ,

 	check.names = , fill = , strip.white = , blank.lines.skip = ,

 	comment.char = , allowEscapes = , flush = , stringsAsFactors = ,

 	encoding = )

The read.table function reads a text file into R and returns a data.frame object. Each row in the input file is interpreted as an observation. Each column in the input file represents a variable. The read.table function expects each field to be separated by a delimiter.

For example, suppose that you had a file called top.5.salaries.csv that contained the following text (and only this text):

name.last,name.first,team position,salary

"Manning","Peyton","Colts","QB",18700000

"Brady","Tom","Patriots","QB",14626720

"Pepper","Julius","Panthers","DE",14137500

"Palmer","Carson","Bengals","QB",13980000

"Manning","Eli","Giants","QB",12916666

Notice how this data is encoded:

  • The first row contains the column names.

  • Each text field is encapsulated in quotes.

  • Each field is separated by commas.

To load this file into R, you would specify that the first row contained column names (header=TRUE), that the delimiter was a comma (sep=","), and that quotes were used to encapsulate text (quote="\""). Here is an R statement that loads in this file:

> top.5.salaries <- read.table("top.5.salaries.csv",

+ header=TRUE,

+ 	sep=",",

+ quote="\"")

The read.table function is very flexible and allows you to load files with many different properties. Here is a brief description of the options for read.table.

ArgumentDescriptionDefault
fileThe name of the file to open or, alternatively, the name of a connection containing the data. You can even use a URL. (This is the one required argument for read.table.)None
headerA logical value indicating whether the first row of the file contains variable names.FALSE
sepThe character (or characters) separating fields. When “” is specified, any whitespace is used as a separator.“”
quoteIf character values are enclosed in quotes, this argument should specify the type of quotes.“”
decThe character used for decimal points..
row.namesA character vector containing row names for the returned data frame.None
col.namesA character vector containing column names for the returned data frame.None
as.isA logical vector (the same length as the number of columns) that specifies whether or not to convert character values to factors.!stringsAsFactors
na.stringsA character vector specifying values that should be interpreted as NA.NA
colClassesA character vector of class names to be assigned to each column.NA
nrowsAn integer value specifying the number of rows to read. (Invalid values, such as negatives, are ignored.)-1
skipAn integer value specifying the number of rows in the text file to skip before beginning to read data.0
check.namesA logical value that specifies whether read.table should check if the column names are valid symbol names in R.TRUE
fillSometimes, a file might contain rows of unequal length. This argument is a logical value that specifies whether read.table should implicitly add blank fields at the end of rows where some values were missing.!blank.lines.skip
strip.whiteWhen sep !="", this logical value specifies whether read.table should remove extra leading and trailing white space from character fields.FALSE
blank.lines.skipA logical value that specifies whether read.table should ignore blank lines.TRUE
comment.charread.table can ignore comment lines in input files if the comment lines begin with a single special character. This argument specifies the character used to delineate these lines.#
allowEscapesA logical value that indicates whether escapes (such as “\n” for a new line) should be interpreted or if character strings should be read literally.FALSE
flushA logical value that indicates whether read.table should skip to the next line when all requested fields have been read in from a line.FALSE
stringsAsFactorsA logical value indicating whether text fields should be converted to factors.default.stringsAsFactors()
encodingThe encoding scheme used for the source file.“unknown”

The most important options are sep and header. You almost always have to know the field separator and know if there is a header field. R includes a set of convenience functions that call read.table with different default options for these values (and a couple of others). Here is a description of these functions.

Functionheadersepquotedecfillcomment.char
read.tableFALSE \” or \’.!blank.lines.skip#
read.csvTRUE,\”.TRUE 
read.csv2TRUE;\”,TRUE 
read.delimTRUE\t\”.TRUE 
read.delim2TRUE\t\”,TRUE 

In most cases, you will find that you can use read.csv for comma-separated files or read.delim for tab-delimited files without specifying any other options. (Except, I suppose, if you are in Europe, and you use commas to indicate the decimal point in numbers. Then you can use read.csv2 and read.delim2.)

As another example, suppose that you wanted to analyze some historical stock quote data. Yahoo! Finance provides this information in an easily downloadable form on its website; you can fetch a CSV file from a single URL. For example, to fetch the closing price of the S&P 500 index for every month between April 1, 1999, and April 1, 2009, you could use the following URL: http://ichart.financ...g=m&ignore=.csv.

Conveniently, you can use a URL in place of a filename in R. This means that you could load this data into R with the following expression:

> sp500 <- read.csv(paste("http://ichart.finance.yahoo.com/table.csv?",

+ "s=%5EGSPC&a=03&b=1&c=1999&d=03&e=1&f=2009&g=m&ignore=.csv"))

> # show the first 5 rows



> sp500[1:5,]

 	Date Open High	Low Close 	Volume Adj.Close

1 2009-04-01 793.59 813.62 783.32 811.08 12068280000	811.08

2 2009-03-02 729.57 832.98 666.79 797.87 7633306300	797.87

3 2009-02-02 823.09 875.01 734.52 735.09 7022036200	735.09

4 2009-01-02 902.99 943.85 804.30 825.88 5844561500	825.88

5 2008-12-01 888.61 918.85 815.69 903.25 5320791300	903.25

We will revisit this example in the next section.

If you’re trying to load a really big file, you might find that loading the file takes a long time. It can be very frustrating to wait 15 minutes for a file to load, only to discover that you have specified the wrong separator. A useful technique for testing is to only load a small number of rows into R. For example, to load 20 rows, you would add nrows=20 as an argument to read.table.

Many programs can export data as text files. Here are a few tips for creating text files that you can easily read into R:

  • For Microsoft Excel spreadsheets, you can export them as either comma-delimited files (CSV files) or tab-delimited files (TXT files). When possible, you should specify Unix-style line delimiters, not MS-DOS line delimiters. (MS-DOS files end each line with “\n\r,” while Unix-style systems end lines with “\n.”) There are two things to think about when choosing between CSV and TXT files.

    CSV files can be more convenient because (by default) opening these files in Windows Explorer will open these files in Microsoft Excel. However, if you are using CSV files, then you must be careful to enclose text in quotes if the data contains commas (and, additionally, you must escape any quotation marks within text fields). Tab characters occur less often in text, so tab-delimited files are less likely to cause problems.

  • If you are exporting data from a database, consider using a GUI tool to query the database and export the results. It is possible to use command-line scripts to export data using tools like sqlplus, pgsql, or mysql, but doing so is often tricky.

    Here are a few options that I have tried. If you are using Microsoft Windows, a good choice is Toad for Data Analysts (available from http://www.toadsoft....a/tdaindex.html); this will work with many different databases. If you are exporting from MySQL, MySQL Query Browser is also a good choice; versions are available for Microsoft Windows, Mac OS X, and Linux (you can download it from http://dev.mysql.com...-tools/5.0.html). Oracle now produces a free multi-platform query tool called SQL Developer. (You can find it at http://www.oracle.co...oper/index.html.)

Fixed-width files

To read a fixed-width format text file into a data frame, you can use the read.fwf function:

read.fwf(file, widths, header = , sep = ,

 	skip = , row.names, col.names, n = ,

 	buffersize = , ...)

Here is a description of the arguments to read.fwf.

ArgumentDescriptionDefault
fileThe name of the file to open or, alternatively, the name of a connection containing the data. (This is a required argument.) 
widthsAn integer vector or a list of integer vectors. If the input file has one record per line, then use an integer vector where each value represents the width of each variable. If each record spans multiple lines, then use a list of integer vectors where each integer vector corresponds to the widths of the variables on that line. (This is a required argument.) 
headerA logical value indicating whether the first line of the file contains variable names. (If it does, the names must be delimited by sep.)FALSE
sepThe character used to delimit variable names in the header.\t
skipAn integer specifying the number of lines to skip at the beginning of the file.A0
row.namesA character vector used to specify row names in the data frame. 
col.namesA character vector used to specify column names in the data frame. 
nAn integer value specifying the number of rows to records to read into R. (Invalid values, such as negatives, are ignored.)-1
buffersizeAn integer specifying the maximum number of lines to be read at one time. (This value may be tuned to optimize performance.)2,000

Note that read.fwf can also take many arguments used by read.table, including as.is, na.strings, colClasses, and strip.white.

Using Other Languages to Preprocess Text Files

R is a very good system for numerical calculations and data visualization, but it’s not the most efficient choice for processing large text files. For example, the U.S. Centers for Disease Control and Prevention publishes data files containing information on every death in the United States (see http://www.cdc.gov/n...statsonline.htm). These data files are provided in a fixed-width format. They are very large; the data file for 2006 was 1.1 GB uncompressed. In theory, you could load a subset of data from this file into R using a statement like this:

> # data from ftp://ftp.cdc.gov/pu...lth_Statistics/

NCHS/Datasets/DVS/mortality/mort2006us.zip

> mort06 <- read.fwf(file="MORT06.DUSMCPUB",

+ widths= c(19,1,40,2,1,1,2,2,1,4,1,2,2,2,2,1,1,1,16,4,1,1,1,1,

+ 34,1,1,4,3,1,3,3,2,283,2,1,1,1,1,33,3,1,1),

+ col.names= c("X0","ResidentStatus","X1","Education1989",

+ "Education2003","EducationFlag","MonthOfDeath",

+ "X5","Sex","AgeDetail","AgeSubstitution",

+ "AgeRecode52","AgeRecode27","AgeRecode12",

+ "AgeRecodeInfant22","PlaceOfDeath","MaritalStatus",

+ "DayOfWeekofDeath","X15","CurrentDataYear",

+ "InjuryAtWork","MannerOfDeath","MethodOfDisposition",

+ "Autopsy","X20","ActivityCode","PlaceOfInjury",

+ "ICDCode","CauseRecode358","X24","CauseRecode113",

+ "CauseRecode130","CauseRecord39","X27","Race",

+ "BridgeRaceFlag","RaceImputationFlag","RaceRecode3",

+ "RaceRecord5","X32","HispanicOrigin","X33",

+ "HispanicOriginRecode","X34")

+ )

Unfortunately, this probably won’t work very well. First, R processes files less quickly than some other languages. Second, R will try to load the entire table into memory. The file takes up 1.1 GB as a raw text file. Many fields in this file are used to encode categorical values that have a small number of choices (such as race) but show the value as numbers. R will convert these character values from single characters (which take up 1 byte) to integers (which take up 4 bytes). This means that it will take a lot of memory to load this file into your computer.

As an alternative, I’d suggest using a scripting language like Perl, Python, or Ruby to preprocess large, complex text files and turn them into a digestible form. (As a side note, I usually write out lists of field names and lengths in Excel and then use Excel formulas to create the R or Perl code to load them. That’s how I generated all of the code shown in this example.) Here’s the Perl script that I used to preprocess the raw mortality data file, filtering out fields I didn’t need and writing the results to a CSV file:

#!/usr/bin/perl



# file to preprocess (and filter) mortality data



print "ResidentStatus,Education1989,Education2003,EducationFlag," .

 "MonthOfDeath,Sex,AgeDetail,AgeSubstitution,AgeRecode52," .

 "AgeRecode27,AgeRecode12,AgeRecodeInfant22,PlaceOfDeath," .

 "MaritalStatus,DayOfWeekofDeath,CurrentDataYear,InjuryAtWork," .

 "MannerOfDeath,MethodOfDisposition,Autopsy,ActivityCode," .

 "PlaceOfInjury,ICDCode,CauseRecode358,CauseRecode113," .

 "CauseRecode130,CauseRecord39,Race,BridgeRaceFlag," .

 "RaceImputationFlag,RaceRecode3,RaceRecord5,HispanicOrigin," .

 "HispanicOriginRecode\n";



while(<>) {

	my ($X0,$ResidentStatus,$X1,$Education1989,$Education2003,

 $EducationFlag,$MonthOfDeath,$X5,$Sex,$AgeDetail,

 $AgeSubstitution,$AgeRecode52,$AgeRecode27,$AgeRecode12,

 $AgeRecodeInfant22,$PlaceOfDeath,$MaritalStatus,

 $DayOfWeekofDeath,$X15,$CurrentDataYear,$InjuryAtWork,

 $MannerOfDeath,$MethodOfDisposition,$Autopsy,$X20,$ActivityCode,

 $PlaceOfInjury,$ICDCode,$CauseRecode358,$X24,$CauseRecode113,

 $CauseRecode130,$CauseRecord39,$X27,$Race,$BridgeRaceFlag,

 $RaceImputationFlag,$RaceRecode3,$RaceRecord5,$X32,

 $HispanicOrigin,$X33,$HispanicOriginRecode,$X34)

 	= unpack("a19a1a40a2a1a1a2a2a1a4a1a2a2a2a2a1a1a1a16a4a1" .

 "a1a1a1a34a1a1a4a3a1a3a3a2a283a2a1a1a1a1a33a3a1a1",

 	$_);

 print "$ResidentStatus,$Education1989,$Education2003,". 

 "$EducationFlag,$MonthOfDeath,$Sex,$AgeDetail,".

 "$AgeSubstitution,$AgeRecode52,$AgeRecode27,".

 "$AgeRecode12,$AgeRecodeInfant22,$PlaceOfDeath," .

 "$MaritalStatus,$DayOfWeekofDeath,$CurrentDataYear,".

 "$InjuryAtWork,$MannerOfDeath,$MethodOfDisposition,".

 "$Autopsy,$ActivityCode,$PlaceOfInjury,$ICDCode,".

 "$CauseRecode358,$CauseRecode113,$CauseRecode130,".

 "$CauseRecord39,$Race,$BridgeRaceFlag,$RaceImputationFlag,".

 "$RaceRecode3,$RaceRecord5,$HispanicOrigin," .

 "$HispanicOriginRecode\n";

}



 	 

I executed this script with the following command in a bash shell:

./mortalities.pl < MORT06.DUSMCPUB > MORT06.csv

You can now load the data into R with a line like this:

> mort06 <- read.csv(file="~/Documents/book/data/MORT06.csv")

We’ll come back to this data set in the chapters on statistical tests and statistical models.


Other functions to parse data

Most of the time, you should be able to load text files into R with the read.table function. Sometimes, however, you might be provided with a file that cannot be read correctly with this function. For example, observations in the file might span multiple lines. To read data into R one line at a time, use the function readLines:

readLines(con = stdin(), n = -1L, ok = TRUE, warn = TRUE,

 	encoding = "unknown")

The readLines function will return a character vector, with one value corresponding to each row in the file. Here is a description of the arguments to readLines.

ArgumentDescriptionDefault
conA character string (specifying a file or URL) or a connection containing the data to read.stdin()
nAn integer value specifying the number of lines to read. (Negative values mean “read until the end of the file.”)-1L
okA logical value specifying whether to trigger an error if the number of lines in the file is less than n.TRUE
warnA logical value specifying whether to warn the user if the file does not end with an EOL.TRUE
encodingA character value specifying the encoding of the input file.“unknown”

Note that you can use readLines interactively to enter data.

Another useful function for reading more complex file formats is scan:

scan(file = "", what = double(0), nmax = -1, n = -1, sep = "",

 	quote = if(identical(sep, "\n")) "" else "'\"", dec = ".",

 	skip = 0, nlines = 0, na.strings = "NA",

 	flush = FALSE, fill = FALSE, strip.white = FALSE,

 	quiet = FALSE, blank.lines.skip = TRUE, multi.line = TRUE,

 	comment.char = "", allowEscapes = FALSE,

 	encoding = "unknown")

The scan function allows you to read the contents of a file into R. Unlike readLines, scan allows you to read data into a specifically defined data structure using the argument what.

Here is a description of the arguments to scan.

ArgumentDescriptionDefault
fileA character string (specifying a file or URL) or a connection containing the data to read.“”
whatThe type of data to be read. If all fields are the same type, you can specify logical, integer, numeric, complex, character, or raw. Otherwise, specify a list of types to read values into a list. (You can specify the type of each element in the list individually.)double(0)
nmaxAn integer value specifying the number of values to read or the number of records to read (if what is a list). (Negative values mean “read until the end of the file.”)-1
nAn integer value specifying the number of values to read. (Negative values mean “read until the end of the file.”)-1
sepCharacter value specifying the separator between values. sep="" means that any whitespace character is interpreted as a separator.“”
quoteCharacter value used to quote strings.if(identical(sep, "\n")) "" else "'\""
decCharacter value used for decimal place in numbers.“.”
skipNumber of lines to skip at the top of the file.0
nlinesNumber of lines of data to read. Nonpositive values mean that there is no limit.0
na.stringsCharacter values specifying how NA values are encoded.“NA”
flushA logical value specifying whether to “flush” any remaining text on a line after the last requested item on a line is read into what. (Commonly used to allow comments at the end of lines or to ignore unneeded fields.)FALSE
fillSpecifies whether to add empty fields to lines with fewer fields than specified by what.FALSE
strip.whiteSpecifies whether to strip leading and trailing whitespace from character fields. Only applies when sep is specified.FALSE
quietIf quiet=FALSE, scan will print a message showing how many lines were read. If quiet=TRUE, this message is suppressed.FALSE
blank.lines.skipSpecifies whether to ignore blank lines.TRUE
multi.lineIf what is a list, allows records to span multiple lines.TRUE
comment.charNotes a character to be used to specify comment lines.“”
allowEscapesSpecifies whether C-style escapes (such as \t for tab character or \n for newlines) should be interpreted by scan or read verbatim. If allowEscapes=FALSE, they are interpreted as special characters; if allowEscapes=TRUE, they are read literally.FALSE
encodingA character value specifying the encoding of the input file.“unknown”

Like readLines, you can also use scan to enter data directly into R.

 

Other Software

Although many software packages can export data as text files, you might find it more convenient to read their data files directly. R can read files in many other formats. Table 12-1 shows a list of functions for reading (and writing) files in other formats. You can find more information about these functions in the help files.

Table 12-1. Functions to read and write data
File formatReadingWriting
ARFFread.arffwrite.arff
DBFread.dbfwrite.dbf
Stataread.dtawrite.dta
Epi Inforead.epiinfo 
Minitabread.mtp 
Octaveread.octave 
S3 binary files, data.dump filesread.S 
SPSSread.spss 
SAS Permanent Datasetread.ssd 
Systatread.sysstat 
SAS XPORT Fileread.xport 

R in a Nutshell

Learn more about this topic from R in a Nutshell.

R is rapidly becoming the standard for developing statistical software, and R in a Nutshell provides a quick and practical way to learn this increasingly popular open source language and environment. You'll not only learn how to program in R, but also how to find the right user-contributed R packages for statistical modeling, visualization, and bioinformatics.

See what you'll learn


Tags:
1 Subscribe


0 Replies