---
title: "Introduction to webreadr"
author: "Oliver Keyes"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to webreadr}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
# Reading web access logs
R, as a language, is used for analysing pretty much everything from genomic data to financial information. It's also
used to analyse website access logs, and R lacks a good framework for doing that; the URL decoder isn't vectorised,
the file readers don't have convenient defaults, and good luck normalising IP addresses at scale.
Enter webreadr
, which contains convenient wrappers and functions for reading, munging and formatting
data from access logs and other sources of web request data.
## File reading
Base R has read.delim, which is convenient but much slower for file reading than Hadley's new [readr](https://github.com/hadley/readr)
package. webreadr
defines a set of wrapper functions around readr's read_delim
, designed
for common access log formats.
The most common historical log format is the [Combined Log Format](http://httpd.apache.org/docs/1.3/logs.html#combined); this is used as one of the default formats for [nginx](http://nginx.org/) and the [Varnish caching system](https://www.varnish-cache.org/docs/trunk/reference/varnishncsa.html). webreadr
lets you read it in trivially with read\_combined
:
```{r}
library(webreadr)
#read in an example file that comes with the webreadr package
data <- read_combined(system.file("extdata/combined_log.clf", package = "webreadr"))
#And if we look at the format...
str(data)
```
As you can see, the types have been appropriately set, the date/times have been parsed, and sensible header names have been set.
The same thing can be done with the Common Log Format, used by Apache default configurations and as one of the defaults for
Squid caching servers, using read\_clf
. The other squid default format can be read with read\_squid
.
Amazon's AWS files are also supported, with read\_aws
, which includes automatic field detection, and S3 bucket
access logs can be read with read\_s3
.
## Splitting combined fields
One of the things you'll notice about the example above is the "request" field - it contains not only the actual asset
requested, but also the HTTP method used and the protocol used. That's pretty inconvenient for people looking to do something
productive with the data.
Normally you'd split each field out into a list, and then curse and recombine them into a data.frame and hope that
doing so didn't hit R's memory limit during the "unlist" stage, and it'd take an absolute age. Or, you could just split them
up directly into a data frame using split\_clf
:
```{r}
requests <- split_clf(data$request)
str(requests)
```
This is faster than manual splitting-and-data.frame-ing, easier on the end user, and less likely to end in unexpected segfaults with
large datasets. It also works on the uri
within S3 access logs.
A similar function, split\_squid
, exists for the `status\_code` field in files read in with read_squid
, which suffer from a similar problem.
## Other ideas
If you have ideas for other URL handlers that would make access log processing easier, the best approach
is to either [request it](https://github.com/Ironholds/webreadr/issues) or [add it](https://github.com/Ironholds/webreadr/pulls)!