--- title: "Introduction to webreadr" author: "Oliver Keyes" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to webreadr} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- # Reading web access logs R, as a language, is used for analysing pretty much everything from genomic data to financial information. It's also used to analyse website access logs, and R lacks a good framework for doing that; the URL decoder isn't vectorised, the file readers don't have convenient defaults, and good luck normalising IP addresses at scale. Enter webreadr, which contains convenient wrappers and functions for reading, munging and formatting data from access logs and other sources of web request data. ## File reading Base R has read.delim, which is convenient but much slower for file reading than Hadley's new [readr](https://github.com/hadley/readr) package. webreadr defines a set of wrapper functions around readr's read_delim, designed for common access log formats. The most common historical log format is the [Combined Log Format](http://httpd.apache.org/docs/1.3/logs.html#combined); this is used as one of the default formats for [nginx](http://nginx.org/) and the [Varnish caching system](https://www.varnish-cache.org/docs/trunk/reference/varnishncsa.html). webreadr lets you read it in trivially with read\_combined: ```{r} library(webreadr) #read in an example file that comes with the webreadr package data <- read_combined(system.file("extdata/combined_log.clf", package = "webreadr")) #And if we look at the format... str(data) ``` As you can see, the types have been appropriately set, the date/times have been parsed, and sensible header names have been set. The same thing can be done with the Common Log Format, used by Apache default configurations and as one of the defaults for Squid caching servers, using read\_clf. The other squid default format can be read with read\_squid. Amazon's AWS files are also supported, with read\_aws, which includes automatic field detection, and S3 bucket access logs can be read with read\_s3. ## Splitting combined fields One of the things you'll notice about the example above is the "request" field - it contains not only the actual asset requested, but also the HTTP method used and the protocol used. That's pretty inconvenient for people looking to do something productive with the data. Normally you'd split each field out into a list, and then curse and recombine them into a data.frame and hope that doing so didn't hit R's memory limit during the "unlist" stage, and it'd take an absolute age. Or, you could just split them up directly into a data frame using split\_clf: ```{r} requests <- split_clf(data$request) str(requests) ``` This is faster than manual splitting-and-data.frame-ing, easier on the end user, and less likely to end in unexpected segfaults with large datasets. It also works on the uri within S3 access logs. A similar function, split\_squid, exists for the `status\_code` field in files read in with read_squid, which suffer from a similar problem. ## Other ideas If you have ideas for other URL handlers that would make access log processing easier, the best approach is to either [request it](https://github.com/Ironholds/webreadr/issues) or [add it](https://github.com/Ironholds/webreadr/pulls)!