R, as a language, is used for analysing pretty much everything from genomic data to financial information. It’s also used to analyse website access logs, and R lacks a good framework for doing that; the URL decoder isn’t vectorised, the file readers don’t have convenient defaults, and good luck normalising IP addresses at scale.
Enter webreadr
, which contains convenient wrappers and
functions for reading, munging and formatting data from access logs and
other sources of web request data.
Base R has read.delim, which is convenient but much slower for file
reading than Hadley’s new readr package.
webreadr
defines a set of wrapper functions around readr’s
read_delim
, designed for common access log formats.
The most common historical log format is the Combined Log
Format; this is used as one of the default formats for nginx and the Varnish
caching system. webreadr
lets you read it in trivially
with read_combined
:
library(webreadr)
#read in an example file that comes with the webreadr package
data <- read_combined(system.file("extdata/combined_log.clf", package = "webreadr"))
#And if we look at the format...
str(data)
## spc_tbl_ [12 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ip_address : chr [1:12] "127.0.0.1" "123.123.123.123" "123.123.123.123" "123.123.123.123" ...
## $ remote_user_ident: chr [1:12] NA NA NA NA ...
## $ local_user_ident : chr [1:12] "frank" NA NA NA ...
## $ timestamp : POSIXct[1:12], format: "2000-10-10 20:55:36" "2000-04-26 04:23:48" ...
## $ request : chr [1:12] "GET /apache_pb.gif HTTP/1.0" "GET /pics/wpaper.gif HTTP/1.0" "GET /asctortf/ HTTP/1.0" "GET /pics/5star2000.gif HTTP/1.0" ...
## $ status_code : int [1:12] 200 200 200 200 200 200 200 200 200 200 ...
## $ bytes_sent : int [1:12] 2326 6248 8130 4005 1031 4282 36 10801 11179 887 ...
## $ referer : chr [1:12] "http://www.example.com/start.html" "http://www.jafsoft.com/asctortf/" "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF" "http://www.jafsoft.com/asctortf/" ...
## $ user_agent : chr [1:12] "Mozilla/4.08 [en] (Win98; I ;Nav)" "Mozilla/4.05 (Macintosh; I; PPC)" "Mozilla/4.05 (Macintosh; I; PPC)" "Mozilla/4.05 (Macintosh; I; PPC)" ...
## - attr(*, "spec")=
## .. cols(
## .. ip_address = col_character(),
## .. remote_user_ident = col_character(),
## .. local_user_ident = col_character(),
## .. timestamp = col_datetime(format = "%d/%b/%Y:%H:%M:%S %z"),
## .. request = col_character(),
## .. status_code = col_integer(),
## .. bytes_sent = col_integer(),
## .. referer = col_character(),
## .. user_agent = col_character()
## .. )
As you can see, the types have been appropriately set, the date/times
have been parsed, and sensible header names have been set. The same
thing can be done with the Common Log Format, used by Apache default
configurations and as one of the defaults for Squid caching servers,
using read_clf
. The other squid default format can be read
with read_squid
.
Amazon’s AWS files are also supported, with read_aws
,
which includes automatic field detection, and S3 bucket access logs can
be read with read_s3
.
One of the things you’ll notice about the example above is the “request” field - it contains not only the actual asset requested, but also the HTTP method used and the protocol used. That’s pretty inconvenient for people looking to do something productive with the data.
Normally you’d split each field out into a list, and then curse and
recombine them into a data.frame and hope that doing so didn’t hit R’s
memory limit during the “unlist” stage, and it’d take an absolute age.
Or, you could just split them up directly into a data frame using
split_clf
:
## 'data.frame': 12 obs. of 3 variables:
## $ method : chr "GET" "GET" "GET" "GET" ...
## $ asset : chr "/apache_pb.gif" "/pics/wpaper.gif" "/asctortf/" "/pics/5star2000.gif" ...
## $ protocol: chr "HTTP/1.0" "HTTP/1.0" "HTTP/1.0" "HTTP/1.0" ...
This is faster than manual splitting-and-data.frame-ing, easier on
the end user, and less likely to end in unexpected segfaults with large
datasets. It also works on the uri
within S3 access
logs.
A similar function, split_squid
, exists for the
status\_code
field in files read in with
read_squid
, which suffer from a similar problem.
If you have ideas for other URL handlers that would make access log processing easier, the best approach is to either request it or add it!