Parsing a File

Here we generalize our knowledge of parsing to parse a whole file. After loading your data, you can check the type of columns in different ways, such as by unfolding the object saved in the Environment and applying the functions str or summary.

Now that you've learned how to parse an individual vector, it's time to return to the beginning and explore how readr parses a file. There are two new things that you'll learn about in this section:

How readr automatically guesses the type of each column.
How to override the default specification.

Strategy

readr uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. You can emulate this process with a character vector using guess_parser(), which returns readr's best guess, and parse_guess() which uses that guess to parse the column:

guess_parser("2010-10-01")
#> [1] "date"
guess_parser("15:01")
#> [1] "time"
guess_parser(c("TRUE", "FALSE"))
#> [1] "logical"
guess_parser(c("1", "5", "9"))
#> [1] "double"
guess_parser(c("12,352,561"))
#> [1] "number"

str(parse_guess("2010-10-10"))
#>  Date[1:1], format: "2010-10-10"

The heuristic tries each of the following types, stopping when it finds a match:

logical: contains only "F", "T", "FALSE", or "TRUE".
integer: contains only numeric characters (and -).
double: contains only valid doubles (including numbers like 4.5e-5).
number: contains valid doubles with the grouping mark inside.
time: matches the default time_format.
date: matches the default date_format.
date-time: any ISO8601 date.

If none of these rules apply, then the column will stay as a vector of strings.

Problems

These defaults don't always work for larger files. There are two basic problems:

The first thousand rows might be a special case, and readr guesses a type that is not sufficiently general. For example, you might have a column of doubles that only contains integers in the first 1000 rows.
The column might contain a lot of missing values. If the first 1000 rows contain only NAs, readr will guess that it's a logical vector, whereas you probably want to parse it as something more specific.

readr contains a challenging CSV that illustrates both of these problems:

challenge <- read_csv(readr_example("challenge.csv"))
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   x = col_double(),
#>   y = col_logical()
#> )
#> Warning: 1000 parsing failures.
#>  row col           expected     actual                                                           file
#> 1001   y 1/0/T/F/TRUE/FALSE 2015-01-16 '/Users/runner/work/_temp/Library/readr/extdata/challenge.csv'
#> 1002   y 1/0/T/F/TRUE/FALSE 2018-05-18 '/Users/runner/work/_temp/Library/readr/extdata/challenge.csv'
#> 1003   y 1/0/T/F/TRUE/FALSE 2015-09-05 '/Users/runner/work/_temp/Library/readr/extdata/challenge.csv'
#> 1004   y 1/0/T/F/TRUE/FALSE 2012-11-28 '/Users/runner/work/_temp/Library/readr/extdata/challenge.csv'
#> 1005   y 1/0/T/F/TRUE/FALSE 2020-01-13 '/Users/runner/work/_temp/Library/readr/extdata/challenge.csv'
#> .... ... .................. .......... ..............................................................
#> See problems(...) for more details.

(Note the use of readr_example() which finds the path to one of the files included with the package)

There are two printed outputs: the column specification generated by looking at the first 1000 rows, and the first five parsing failures. It's always a good idea to explicitly pull out the problems(), so you can explore them in more depth:

problems(challenge)
#> # A tibble: 1,000 x 5
#>     row col   expected        actual   file                                     
#>   <int> <chr> <chr>           <chr>    <chr>                                    
#> 1  1001 y     1/0/T/F/TRUE/F… 2015-01… '/Users/runner/work/_temp/Library/readr/…
#> 2  1002 y     1/0/T/F/TRUE/F… 2018-05… '/Users/runner/work/_temp/Library/readr/…
#> 3  1003 y     1/0/T/F/TRUE/F… 2015-09… '/Users/runner/work/_temp/Library/readr/…
#> 4  1004 y     1/0/T/F/TRUE/F… 2012-11… '/Users/runner/work/_temp/Library/readr/…
#> 5  1005 y     1/0/T/F/TRUE/F… 2020-01… '/Users/runner/work/_temp/Library/readr/…
#> 6  1006 y     1/0/T/F/TRUE/F… 2016-04… '/Users/runner/work/_temp/Library/readr/…
#> # … with 994 more rows

A good strategy is to work column by column until there are no problems remaining. Here we can see that there are a lot of parsing problems with the y column. If we look at the last few rows, you'll see that they're dates stored in a character vector:

tail(challenge)
#> # A tibble: 6 x 2
#>       x y    
#>   <dbl> <lgl>
#> 1 0.805 NA   
#> 2 0.164 NA   
#> 3 0.472 NA   
#> 4 0.718 NA   
#> 5 0.270 NA   
#> 6 0.608 NA

That suggests we need to use a date parser instead. To fix the call, start by copying and pasting the column specification into your original call:

challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
    x = col_double(),
    y = col_logical()
  )
)

Then you can fix the type of the y column by specifying that y is a date column:

challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
    x = col_double(),
    y = col_date()
  )
)
tail(challenge)
#> # A tibble: 6 x 2
#>       x y         
#>   <dbl> <date>    
#> 1 0.805 2019-11-21
#> 2 0.164 2018-03-29
#> 3 0.472 2014-08-04
#> 4 0.718 2015-08-16
#> 5 0.270 2020-02-04
#> 6 0.608 2019-01-06

Every parse_xyz() function has a corresponding col_xyz() function. You use parse_xyz() when the data is in a character vector in R already; you use col_xyz() when you want to tell readr how to load the data.

I highly recommend always supplying col_types, building up from the print-out provided by readr. This ensures that you have a consistent and reproducible data import script. If you rely on the default guesses and your data changes, readr will continue to read it in. If you want to be really strict, use stop_for_problems(): that will throw an error and stop your script if there are any parsing problems.

Other strategies

There are a few other general strategies to help you parse files:

In the previous example, we just got unlucky: if we look at just one more row than the default, we can correctly parse in one shot:

challenge2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   x = col_double(),
#>   y = col_date(format = "")
#> )
challenge2
#> # A tibble: 2,000 x 2
#>       x y         
#>   <dbl> <date>    
#> 1   404 NA        
#> 2  4172 NA        
#> 3  3004 NA        
#> 4   787 NA        
#> 5    37 NA        
#> 6  2332 NA        
#> # … with 1,994 more rows

Sometimes it's easier to diagnose problems if you just read in all the columns as character vectors:

challenge2 <- read_csv(readr_example("challenge.csv"), 
  col_types = cols(.default = col_character())
)

This is particularly useful in conjunction with type_convert(), which applies the parsing heuristics to the character columns in a data frame.

df <- tribble(
  ~x,  ~y,
  "1", "1.21",
  "2", "2.32",
  "3", "4.56"
)
df
#> # A tibble: 3 x 2
#>   x     y    
#>   <chr> <chr>
#> 1 1     1.21 
#> 2 2     2.32 
#> 3 3     4.56

# Note the column types
type_convert(df)
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   x = col_double(),
#>   y = col_double()
#> )
#> # A tibble: 3 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1  1.21
#> 2     2  2.32
#> 3     3  4.56

If you're reading a very large file, you might want to set n_max to a smallish number like 10,000 or 100,000. That will accelerate your iterations while you eliminate common problems.
If you're having major parsing problems, sometimes it's easier to just read into a character vector of lines with read_lines(), or even a character vector of length 1 with read_file(). Then you can use the string parsing skills you'll learn later to parse more exotic formats.

Source: H. Wickham and G. Grolemund, https://r4ds.had.co.nz/data-import.html
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Last modified: Monday, January 9, 2023, 3:50 PM