Faster way to read fixed-width files

Multi tool use
Faster way to read fixed-width files
I work with a lot of fixed width files (i.e., no separating character) that I need to read into R. So, there is usually a definition of the column width to parse the string into variables. I can use read.fwf
to read in the data without a problem. However, for large files, this can take a long time. For a recent dataset, this took 800 seconds to read in a dataset with ~500,000 rows and 143 variables.
read.fwf
seer9 <- read.fwf("~/data/rawdata.txt",
widths = cols,
header = FALSE,
buffersize = 250000,
colClasses = "character",
stringsAsFactors = FALSE))
fread
in the data.table
package in R is awesome for solving most data read problems, except it doesn't parse fixed width files. However, I can read each line in as a single character string (~500,000 rows, 1 column). This takes 3-5 seconds. (I love data.table.)
fread
data.table
seer9 <- fread("~/data/rawdata.txt", colClasses = "character",
sep = "n", header = FALSE, verbose = TRUE)
There are a number of good posts on SO on how to parse text files. See JHoward's suggestion here, to create a matrix of start and end columns, and substr
to parse the data. See GSee's suggestion here to use strsplit
. I couldn't figure out how to make that work with this data. (Also, Michael Smith made some suggestions on the data.table mailing list involving sed
that were beyond my ability to implement.) Now, using fread
and substr()
I can do the whole thing in about 25-30 seconds. Note that coercing to a data.table at end takes a chunk of time (5 sec?).
substr
strsplit
sed
fread
substr()
end_col <- cumsum(cols)
start_col <- end_col - cols + 1
start_end <- cbind(start_col, end_col) # matrix of start and end positions
text <- lapply(seer9, function(x) {
apply(start_end, 1, function(y) substr(x, y[1], y[2]))
})
dt <- data.table(text$V1)
setnames(dt, old = 1:ncol(dt), new = seervars)
What I am wondering is whether this can be improved any further? I know I am not the only one who has to read fixed width files, so if this could be made faster, it would make loading even larger files (with millions of rows) more tolerable. I tried using parallel
with mclapply
and data.table
instead of lapply
, but those didn't change anything. (Likely due to my inexperience in R.) I imagine that an Rcpp function could be written to do this really fast, but that is beyond my skill set. Also, I may not be using lapply and apply appropriately.
parallel
mclapply
data.table
lapply
My data.table implementation (with magrittr
chaining) takes the same time:
magrittr
text <- seer9[ , apply(start_end, 1, function(y) substr(V1, y[1], y[2]))] %>%
data.table(.)
Can anyone make suggestions to improve the speed of this? Or is this about as good as it gets?
Here is code to create a similar data.table within R (rather than linking to actual data). It should have 331 characters, and 500,000 rows. There are spaces to simulate missing fields in the data, but this is NOT space delimited data. (I am reading raw SEER data, in case anyone is interested.) Also including column widths (cols) and variable names (seervars) in case this helps someone else. These are the actual column and variable definitions for SEER data.
seer9 <-
data.table(rep((paste0(paste0(letters, 1000:1054, " ", collapse = ""), " ")),
500000))
cols = c(8,10,1,2,1,1,1,3,4,3,2,2,4,4,1,4,1,4,1,1,1,1,3,2,2,1,2,2,13,2,4,1,1,1,1,3,3,3,2,3,3,3,3,3,3,3,2,2,2,2,1,1,1,1,1,6,6,6,2,1,1,2,1,1,1,1,1,2,2,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,7,5,4,10,3,3,2,2,2,3,1,1,1,1,2,2,1,1,2,1,9,5,5,1,1,1,2,2,1,1,1,1,1,1,1,1,2,3,3,3,3,3,3,1,4,1,4,1,1,3,3,3,3,2,2,2,2)
seervars <- c("CASENUM", "REG", "MAR_STAT", "RACE", "ORIGIN", "NHIA", "SEX", "AGE_DX", "YR_BRTH", "PLC_BRTH", "SEQ_NUM", "DATE_mo", "DATE_yr", "SITEO2V", "LATERAL", "HISTO2V", "BEHO2V", "HISTO3V", "BEHO3V", "GRADE", "DX_CONF", "REPT_SRC", "EOD10_SZ", "EOD10_EX", "EOD10_PE", "EOD10_ND", "EOD10_PN", "EOD10_NE", "EOD13", "EOD2", "EOD4", "EODCODE", "TUMOR_1V", "TUMOR_2V", "TUMOR_3V", "CS_SIZE", "CS_EXT", "CS_NODE", "CS_METS", "CS_SSF1", "CS_SSF2", "CS_SSF3", "CS_SSF4", "CS_SSF5", "CS_SSF6", "CS_SSF25", "D_AJCC_T", "D_AJCC_N", "D_AJCC_M", "D_AJCC_S", "D_SSG77", "D_SSG00", "D_AJCC_F", "D_SSG77F", "D_SSG00F", "CSV_ORG", "CSV_DER", "CSV_CUR", "SURGPRIM", "SCOPE", "SURGOTH", "SURGNODE", "RECONST", "NO_SURG", "RADIATN", "RAD_BRN", "RAD_SURG", "SS_SURG", "SRPRIM02", "SCOPE02", "SRGOTH02", "REC_NO", "O_SITAGE", "O_SEQCON", "O_SEQLAT", "O_SURCON", "O_SITTYP", "H_BENIGN", "O_RPTSRC", "O_DFSITE", "O_LEUKDX", "O_SITBEH", "O_EODDT", "O_SITEOD", "O_SITMOR", "TYPEFUP", "AGE_REC", "SITERWHO", "ICDOTO9V", "ICDOT10V", "ICCC3WHO", "ICCC3XWHO", "BEHANAL", "HISTREC", "BRAINREC", "CS0204SCHEMA", "RAC_RECA", "RAC_RECY", "NHIAREC", "HST_STGA", "AJCC_STG", "AJ_3SEER", "SSG77", "SSG2000", "NUMPRIMS", "FIRSTPRM", "STCOUNTY", "ICD_5DIG", "CODKM", "STAT_REC", "IHS", "HIST_SSG_2000", "AYA_RECODE", "LYMPHOMA_RECODE", "DTH_CLASS", "O_DTH_CLASS", "EXTEVAL", "NODEEVAL", "METSEVAL", "INTPRIM", "ERSTATUS", "PRSTATUS", "CSSCHEMA", "CS_SSF8", "CS_SSF10", "CS_SSF11", "CS_SSF13", "CS_SSF15", "CS_SSF16", "VASINV", "SRV_TIME_MON", "SRV_TIME_MON_FLAG", "SRV_TIME_MON_PA", "SRV_TIME_MON_FLAG_PA", "INSREC_PUB", "DAJCC7T", "DAJCC7N", "DAJCC7M", "DAJCC7STG", "ADJTM_6VALUE", "ADJNM_6VALUE", "ADJM_6VALUE", "ADJAJCCSTG")
UPDATE:
LaF did the entire read in just under 7 seconds from the raw .txt file. Maybe there is an even faster way, but I doubt anything could do appreciably better. Amazing package.
27 July 2015 Update
Just wanted to provide a small update to this. I used the new readr package, and I was able to read in the entire file in 5 seconds using readr::read_fwf.
seer9_readr <- read_fwf("path_to_data/COLRECT.TXT",
col_positions = fwf_widths(cols))
Also, the updated stringi::stri_sub function is at least twice as fast as base::substr(). So, in the code above that uses fread to read the file (about 4 seconds), followed by apply to parse each line, the extraction of 143 variables took about 8 seconds with stringi::stri_sub compared to 19 for base::substr. So, fread plus stri_sub is still only about 12 seconds to run. Not bad.
seer9 <- fread("path_to_data/COLRECT.TXT",
colClasses = "character",
sep = "n",
header = FALSE)
text <- seer9[ , apply(start_end, 1, function(y) substr(V1, y[1], y[2]))] %>%
data.table(.)
10 Dec 2015 update:
Please also see the answer below by @MichaelChirico who has added some great benchmarks and the iotools package.
@JanvanderLaan, He is able to read all the data into ram in 5 seconds with
fread()
. Parsing the 500k strings in parallel is the question I think.– bdemarest
Jul 12 '14 at 20:12
fread()
@bdemarest Yes, you are right. For the code using
fread
and substr
, the parsing of the substrings is indeed the bottleneck and this can be done in parallel.– Jan van der Laan
Jul 12 '14 at 20:21
fread
substr
4 Answers
4
Now that there are (between this and the other major question about effective reading of fixed-width files) a fair amount of options on the offer for reading in such files, I think some benchmarking is appropriate.
I'll use the following on-the-large-side (400 MB) file for comparison. It's just a bunch of random characters with randomly defined fields and widths:
set.seed(21394)
wwidth = 400L
rrows = 1000000
#creating the contents at random
contents =
write.table(replicate(rrows, paste0(sample(letters, wwidth, replace = TRUE),
collapse = "")), file="testfwf.txt",
quote = FALSE, row.names = FALSE, col.names = FALSE)
#defining the fields & writing a dictionary
n_fields = 40L
endpoints = unique(c(1L, sort(sample(wwidth, n_fields - 1L)), wwidth + 1L))
cols = ist(beg = endpoints[-(n_fields + 1L)],
end = endpoints[-1L] - 1L)
dict = data.frame(column = paste0("V", seq_len(length(endpoints)) - 1L)),
start = endpoints[-length(endpoints)] - 1,
length = diff(endpoints))
write.csv(dict, file = "testdic.csv", quote = FALSE, row.names = FALSE)
I'll compare five methods mentioned between these two threads (I'll add some others if the authors would like): the base version (read.fwf
), piping the result of in2csv
to fread
(@AnandaMahto's suggestion), Hadley's new readr
(read_fwf
), that using LaF
/ffbase
(@jwijffls' suggestion), and an improved (streamlined) version of that suggested by the question author (@MarkDanese) combining fread
with stri_sub
from stringi
.
read.fwf
in2csv
fread
readr
read_fwf
LaF
ffbase
fread
stri_sub
stringi
Here is the benchmarking code:
library(data.table)
library(stringi)
library(readr)
library(LaF); library(ffbase)
library(microbenchmark)
microbenchmark(times = 5L,
utils = read.fwf("testfwf.txt", diff(endpoints), header = FALSE),
in2csv =
fread(paste("in2csv -f fixed -s",
"~/Desktop/testdic.csv",
"~/Desktop/testfwf.txt")),
readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))),
LaF = {
my.data.laf =
laf_open_fwf('testfwf.txt', column_widths=diff(endpoints),
column_types = rep("character",
length(endpoints) - 1L))
my.data = laf_to_ffdf(my.data.laf, nrows = rrows)
as.data.frame(my.data)},
fread = fread(
"testfwf.txt", header = FALSE, sep = "n"
)[ , lapply(seq_len(length(cols$beg)),
function(ii)
stri_sub(V1, cols$beg[ii], cols$end[ii]))])
And the output:
# Unit: seconds
# expr min lq mean median uq max neval cld
# utils 423.76786 465.39212 499.00109 501.87568 543.12382 560.84598 5 c
# in2csv 67.74065 68.56549 69.60069 70.11774 70.18746 71.39210 5 a
# readr 10.57945 11.32205 15.70224 14.89057 19.54617 22.17298 5 a
# LaF 207.56267 236.39389 239.45985 237.96155 238.28316 277.09798 5 b
# fread 14.42617 15.44693 26.09877 15.76016 20.45481 64.40581 5 a
So it seems readr
and fread
+ stri_sub
are pretty competitive as the fastest; built-in read.fwf
is the clear loser.
readr
fread
stri_sub
read.fwf
Note that the real advantage of readr
here is that you can pre-specify column types; with fread
you'll have to type convert afterwards.
readr
fread
EDIT: Adding some alternatives
At @AnandaMahto's suggestion I am including some more options, including one that appears to be a new winner! To save time I excluded the slowest options above in the new comparison. Here's the new code:
library(iotools)
microbenchmark(times = 5L,
readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))),
fread = fread(
"testfwf.txt", header = FALSE, sep = "n"
)[ , lapply(seq_len(length(cols$beg)),
function(ii)
stri_sub(V1, cols$beg[ii], cols$end[ii]))],
iotools = input.file("testfwf.txt", formatter = dstrfw,
col_types = rep("character",
length(endpoints) - 1L),
widths = diff(endpoints)),
awk = fread(paste(
"awk -v FIELDWIDTHS='",
paste(diff(endpoints), collapse = " "),
"' -v OFS=', ' '{$1=$1 ""; print}' < ~/Desktop/testfwf.txt",
collapse = " "), header = FALSE))
And the new output:
# Unit: seconds
# expr min lq mean median uq max neval cld
# readr 7.892527 8.016857 10.293371 9.527409 9.807145 16.222916 5 a
# fread 9.652377 9.696135 9.796438 9.712686 9.807830 10.113160 5 a
# iotools 5.900362 7.591847 7.438049 7.799729 7.845727 8.052579 5 a
# awk 14.440489 14.457329 14.637879 14.472836 14.666587 15.152156 5 b
So it appears iotools
is both very fast and very consistent.
iotools
The benchmarks are useful. In the comments at the other question, I suggested trying the "iotools" package. Can you include that in the benchmarks, as well as the "awk" solution? I'm guessing the "awk" approach would be faster than "in2csv", but slower than "fread"/"readr", and, based on my experience with "iotools", I wouldn't be surprised if that's faster than the options available so far. Not tested, but the approach should be something like:
library(iotools); input.file("testfwf.txt", formatter = dstrfw, col_types = rep("character", length(col_ends)-1), widths = diff(col_ends))
. (+1)– A5C1D2H2I1M1N2O1R2T1
Dec 10 '15 at 5:52
library(iotools); input.file("testfwf.txt", formatter = dstrfw, col_types = rep("character", length(col_ends)-1), widths = diff(col_ends))
Oh, and for the error with "sqldf" (which I wouldn't bother testing for speed comparison), it's probably because we need to specify whatever the equivalent of
header = FALSE
would be. Don't really have the time to explore at this moment....– A5C1D2H2I1M1N2O1R2T1
Dec 10 '15 at 6:54
header = FALSE
Thanks to both of you. This is great information. I will edit the original question to guide readers to look down here.
– Mark Danese
Dec 10 '15 at 22:58
You can use the LaF
package, which was written to handle large fixed width files (also too large to fit into memory). To use it you first need to open the file using laf_open_fwf
. You can then index the resulting object as you would a normal data frame to read the data you need. In the example below, I read the entire file, but you can also read specific columns and/or lines:
LaF
laf_open_fwf
library(LaF)
laf <- laf_open_fwf("foo.dat", column_widths = cols,
column_types=rep("character", length(cols)),
column_names = seervars)
seer9 <- laf[,]
Your example using 5000 lines (instead of your 500,000) took 28 seconds using read.fwf
and 1.6 seconds using LaF
.
read.fwf
LaF
Addition Your example using 50,000 lines (instead of your 500,000) took 258 seconds using read.fwf
and 7 seconds using LaF
on my machine.
read.fwf
LaF
I did not know about this package. Wow. 6 seconds. Excellent. About as fast as fread for a CSV file, which is very impressive. Will look into this more, since we have some large datasets. Thanks.
– Mark Danese
Jul 12 '14 at 20:16
I wrote a parser for this kind of thing yesterday, but it was for a very specific kind of input to the header file, so I will show you how to format your column widths to be able to use it.
Converting your flat file to csv
First download the tool in question.
You can download the binary from the bin
directory if you are on OS X Mavericks (where I compiled it on) or compile it by going to src
and using clang++ csv_iterator.cpp parse.cpp main.cpp -o flatfileparser
.
bin
src
clang++ csv_iterator.cpp parse.cpp main.cpp -o flatfileparser
The flat file parser needs two files, a CSV header file in which every fifth element specifies the variable width (again, this is due to my extremely specific application), which you can generate using:
cols = c(8,10,1,2,1,1,1,3,4,3,2,2,4,4,1,4,1,4,1,1,1,1,3,2,2,1,2,2,13,2,4,1,1,1,1,3,3,3,2,3,3,3,3,3,3,3,2,2,2,2,1,1,1,1,1,6,6,6,2,1,1,2,1,1,1,1,1,2,2,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,7,5,4,10,3,3,2,2,2,3,1,1,1,1,2,2,1,1,2,1,9,5,5,1,1,1,2,2,1,1,1,1,1,1,1,1,2,3,3,3,3,3,3,1,4,1,4,1,1,3,3,3,3,2,2,2,2)
writeLines(sapply(c(-1, cols), function(x) paste0(',,,,', x)), '~/tmp/header.csv')
and copying the resulting ~/tmp/header.csv
to the same directory as your flatfileparser
. Move the flat file to the same directory as well, and you can run it on your flat file:
~/tmp/header.csv
flatfileparser
./flatfileparser header.csv yourflatfile
which will produce yourflatfile.csv
. Add the header you have above in manually using piping (>>
from Bash).
yourflatfile.csv
>>
Reading in your CSV file quickly
Use Hadley's experimental fastread package by passing the filename to fastread::read_csv
, which yields a data.frame
. I don't believe he supports fwf
files yet although it is on the way.
fastread::read_csv
data.frame
fwf
I can't seem to get it to work. I am not a command line person, so it may just be me doing something wrong.
mark-mbp-osx:bin mark$ flatfileparser header.csv COLRECT.TXT
gives me -bash: flatfileparser: command not found
on Mavericks. This is the listing of the directory: mark-mbp-osx:bin mark$ ls COLRECT.TXT flatfileparser header.csv
– Mark Danese
Jul 12 '14 at 19:45
mark-mbp-osx:bin mark$ flatfileparser header.csv COLRECT.TXT
-bash: flatfileparser: command not found
mark-mbp-osx:bin mark$ ls COLRECT.TXT flatfileparser header.csv
Try
chmod +x flatfileparser; ./flatfileparser header.csv COLRECT.TXT
– Robert Krzyzanowski
Jul 12 '14 at 19:49
chmod +x flatfileparser; ./flatfileparser header.csv COLRECT.TXT
It seems to have worked even though it gave errors:
mark-mbp-osx:bin mark$ chmod +x flatfileparserchmod +x flatfileparser; ./flatfileparser header.csv COLRECT.TXT
chmod: flatfileparserchmod: No such file or directory
chmod: +x: No such file or directory
mark-mbp-osx:bin mark$
– Mark Danese
Jul 12 '14 at 19:55
mark-mbp-osx:bin mark$ chmod +x flatfileparserchmod +x flatfileparser; ./flatfileparser header.csv COLRECT.TXT
chmod: flatfileparserchmod: No such file or directory
chmod: +x: No such file or directory
mark-mbp-osx:bin mark$
I think you pasted the string "chmod +x flatfileparser" twice. Try two separate commands: first
chmod +x flatfileparser
and then ./flatfileparser header.csv COLRECT.TXT
– Robert Krzyzanowski
Jul 12 '14 at 20:02
chmod +x flatfileparser
./flatfileparser header.csv COLRECT.TXT
My fault, I pasted it into SO twice. I ended up with 144 columns instead of 143. It seems to work fine, so thanks. I am not sure I could use this regularly or on our Windows server. It would be great if it were easy to access from within R. I am just not a real programmer.
– Mark Danese
Jul 12 '14 at 20:09
I'm not sure what OS you are using, but this worked pretty straightforwardly for me in Linux:
Step 1: Create a command for awk
to convert the file to a csv
awk
You can have it stored to an actual csv file if you plan to use the data in other software too.
myCommand <- paste(
"awk -v FIELDWIDTHS='",
paste(cols, collapse = " "),
"' -v OFS=',' '{$1=$1 ""; print}' < ~/rawdata.txt",
collapse = " ")
Step 2: Use fread
directly on that command that you just created
fread
seer9 <- fread(myCommand)
I haven't timed this because I'm obviously using a slower system than you and Jan :-)
Thanks a lot. I was hoping someone might suggest something like this. I tried it and it returned an error.
Error in fread(myCommand) : ' ends field 14 on line 26 when detecting types: 428135680000001527 . . .
I couldn't paste the entire 331 char string. Not sure what the issue is. This is OSX (Mavericks). I should probably force all to char for now.– Mark Danese
Jul 14 '14 at 21:39
Error in fread(myCommand) : ' ends field 14 on line 26 when detecting types: 428135680000001527 . . .
I tried forcing all to character. But the issue is that freed is only detecting 15 columns, not 143. Here is an edited version of my Command dropping many col values to fit in this comment:
"awk -v FIELDWIDTHS=' 8 10 1 2 1 1 1 3 4 3 2 2 4 4 1 4 1 4 1 1 1 1 3 2 2 1 2 2 13 2 4 1 1 ' -v OFS=',' '{$1=$1 ""; print}' < ~/file.TXT"
– Mark Danese
Jul 14 '14 at 21:43
"awk -v FIELDWIDTHS=' 8 10 1 2 1 1 1 3 4 3 2 2 4 4 1 4 1 4 1 1 1 1 3 2 2 1 2 2 13 2 4 1 1 ' -v OFS=',' '{$1=$1 ""; print}' < ~/file.TXT"
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Parallel reading your file isn't going to help. The bottleneck is the file IO. (Except of course when the data is spread across multiple machines/hard drives.)
– Jan van der Laan
Jul 12 '14 at 20:00