Reading web access logs

R, as a language, is used for analysing pretty much everything from genomic data to financial information. It’s also used to analyse website access logs, and R lacks a good framework for doing that; the URL decoder isn’t vectorised, the file readers don’t have convenient defaults, and good luck normalising IP addresses at scale.

Enter webreadr, which contains convenient wrappers and functions for reading, munging and formatting data from access logs and other sources of web request data.

File reading

Base R has read.delim, which is convenient but much slower for file reading than Hadley’s new readr package. webreadr defines a set of wrapper functions around readr’s read_delim, designed for common access log formats.

The most common historical log format is the Combined Log Format; this is used as one of the default formats for nginx and the Varnish caching system. webreadr lets you read it in trivially with read_combined:

library(webreadr)
#read in an example file that comes with the webreadr package
data <- read_combined(system.file("extdata/combined_log.clf", package = "webreadr"))
#And if we look at the format...
str(data)

## spc_tbl_ [12 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ip_address       : chr [1:12] "127.0.0.1" "123.123.123.123" "123.123.123.123" "123.123.123.123" ...
##  $ remote_user_ident: chr [1:12] NA NA NA NA ...
##  $ local_user_ident : chr [1:12] "frank" NA NA NA ...
##  $ timestamp        : POSIXct[1:12], format: "2000-10-10 20:55:36" "2000-04-26 04:23:48" ...
##  $ request          : chr [1:12] "GET /apache_pb.gif HTTP/1.0" "GET /pics/wpaper.gif HTTP/1.0" "GET /asctortf/ HTTP/1.0" "GET /pics/5star2000.gif HTTP/1.0" ...
##  $ status_code      : int [1:12] 200 200 200 200 200 200 200 200 200 200 ...
##  $ bytes_sent       : int [1:12] 2326 6248 8130 4005 1031 4282 36 10801 11179 887 ...
##  $ referer          : chr [1:12] "http://www.example.com/start.html" "http://www.jafsoft.com/asctortf/" "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF" "http://www.jafsoft.com/asctortf/" ...
##  $ user_agent       : chr [1:12] "Mozilla/4.08 [en] (Win98; I ;Nav)" "Mozilla/4.05 (Macintosh; I; PPC)" "Mozilla/4.05 (Macintosh; I; PPC)" "Mozilla/4.05 (Macintosh; I; PPC)" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ip_address = col_character(),
##   ..   remote_user_ident = col_character(),
##   ..   local_user_ident = col_character(),
##   ..   timestamp = col_datetime(format = "%d/%b/%Y:%H:%M:%S %z"),
##   ..   request = col_character(),
##   ..   status_code = col_integer(),
##   ..   bytes_sent = col_integer(),
##   ..   referer = col_character(),
##   ..   user_agent = col_character()
##   .. )

As you can see, the types have been appropriately set, the date/times have been parsed, and sensible header names have been set. The same thing can be done with the Common Log Format, used by Apache default configurations and as one of the defaults for Squid caching servers, using read_clf. The other squid default format can be read with read_squid.

Amazon’s AWS files are also supported, with read_aws, which includes automatic field detection, and S3 bucket access logs can be read with read_s3.

Splitting combined fields

One of the things you’ll notice about the example above is the “request” field - it contains not only the actual asset requested, but also the HTTP method used and the protocol used. That’s pretty inconvenient for people looking to do something productive with the data.

Normally you’d split each field out into a list, and then curse and recombine them into a data.frame and hope that doing so didn’t hit R’s memory limit during the “unlist” stage, and it’d take an absolute age. Or, you could just split them up directly into a data frame using split_clf:

requests <- split_clf(data$request)
str(requests)

## 'data.frame':    12 obs. of  3 variables:
##  $ method  : chr  "GET" "GET" "GET" "GET" ...
##  $ asset   : chr  "/apache_pb.gif" "/pics/wpaper.gif" "/asctortf/" "/pics/5star2000.gif" ...
##  $ protocol: chr  "HTTP/1.0" "HTTP/1.0" "HTTP/1.0" "HTTP/1.0" ...

This is faster than manual splitting-and-data.frame-ing, easier on the end user, and less likely to end in unexpected segfaults with large datasets. It also works on the uri within S3 access logs.

A similar function, split_squid, exists for the status\_code field in files read in with read_squid, which suffer from a similar problem.

Other ideas

If you have ideas for other URL handlers that would make access log processing easier, the best approach is to either request it or add it!

- Reading web access logs

Introduction to webreadr

Reading web access logs

File reading

Splitting combined fields

Other ideas