--- title: "Session reconstruction in R" author: "Oliver Keyes" date: "`r Sys.Date()`" vignette: > %\VignetteIndexEntry{Session reconstruction in R} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- ## Session reconstruction in R With trace data - from web logs, behavioural logs, really anything to do with user actions - reconstructing sessions (or 'sessionising') is essential. It lets an analyst divide up user actions into actual periods of sustained interaction, and from there compute a whole host of useful metrics, from session length to bounce rate. `reconstructr` is a library for just that - sessionisation and metric computation - in a way that keeps all the metadata about the events you're sessionising. ### Dividing events into sessions The nature of a "session" has provided fodder for researchers for years. Most people take an approach based on inactivity thresholds; if someone does not take an action in N seconds, their session has ended and a new one begins on their next logged action. Using reconstructr, we can conveniently divide events into sessions with the `sessionise` function. This takes a data.frame of events (along with specifiers of which column contains the user ID, and which column contains the timestamp) and a threshold value (in seconds). When the time between events by a user crosses that threshold, the session ends and a new one begins. We can demonstrate this using reconstructr's inbuilt session dataset: ```{r, eval=FALSE} library(reconstructr) data("session_dataset") str(session_dataset) 'data.frame': 63524 obs. of 3 variables: $ uuid : chr "47dc43895814861e21a2edf93348c826" "a736822df1890011694e7049cb3abef3" "674d2d00e096a3319874a4347caa1f4a" "f62d315398e6d04a3f2fa02e8ae42d49" ... $ timestamp: POSIXlt, format: "2014-01-07 00:00:15" "2014-01-07 00:01:11" "2014-01-07 00:01:54" ... $ url : chr "https://www.nasa.gov/history/mercury/mercury.html" "https://www.nasa.gov/images/ksclogosmall.gif" "https://www.nasa.gov/elv/hot.gif" "https://www.nasa.gov/facts/faq04.html" ... sessionised_data <- sessionise(x = session_dataset, timestamp = timestamp, user_id = uuid, threshold = 1800) str(sessionised_data) 'data.frame': 63524 obs. of 5 variables: $ uuid : chr "0005839b3e8483d50870f61f50307fa7" "000b047bad36484451f12c114ab5eb28" "000b047bad36484451f12c114ab5eb28" "000b047bad36484451f12c114ab5eb28" ... $ timestamp : POSIXlt, format: "2014-01-14 12:47:59" "2014-01-07 14:25:11" "2014-01-09 12:47:17" ... $ url : chr "https://www.nasa.gov/history/apollo/images/footprint-logo.gif" "https://www.nasa.gov/ksc.html" "https://www.nasa.gov/biomed/threat/gif/beachmousefinsmall.gif" "https://www.nasa.gov/shuttle/resources/orbiters/atlantis.html" ... $ session_id: chr "09cd65049020ed55472a2d8b1f47787e" "9dcb2f610297b3fe2c810907fa90fb8e" "70bcde51eff332d4ac820a90930f0f6e" "70bcde51eff332d4ac820a90930f0f6e" ... $ time_delta: int NA NA NA 45 4 75 274 47 NA 28 ... ``` Sessionisation adds two new columns; 'session\_id', a unique per-session ID, and 'time\_delta' - the time between an event and the previous event in the session. If the event was the first (or only) one in a session, that value will be `NA`. Crucially, existing metadata (like URLs, or activity type) is carried along with the session information and not dropped. From the sessionised data we can compute a whole host of useful metrics, many of which have convenience functions in this package. ### Session metrics An important metric in session data is the bounce rate: the proportion of sessions that included only a single event. This represents (absent data quality issues) the number of sessions where a user took only one action and then simply left. It can be computed with `bounce_rate`, which takes a sessionised dataset and produces the percentage of sessions resulting in bounces. Optionally (if you provide an argument for the `user_id` parameter) it produces the bounce rate on a per-user basis, rather than for the dataset overall: ```{r, eval=FALSE} str(bounce_rate(sessionised_data)) num 20.7 str(bounce_rate(sessionised_data, user_id = uuid)) 'data.frame': 10000 obs. of 2 variables: $ user_id : chr "0005839b3e8483d50870f61f50307fa7" "000b047bad36484451f12c114ab5eb28" "000b2bc1a5438d8d54d4fbec139a2fd5" "001b6e80a14ba8d809c4ff18cdbade40" ... $ bounce_rate: num 100 14.3 0 100 100 ... ``` `time_on_page` is very similar, calculating either the mean (or median) time between events - either for the dataset as a whole or, if the `by_session` parameter is TRUE, for each session: ```{r, eval=FALSE} str(time_on_page(sessionised_data)) num 146 str(time_on_page(sessionised_data, by_session = TRUE)) 'data.frame': 22226 obs. of 2 variables: $ session_id : chr "00011b1e098848edee7e50a2174fe6ef" "0001f6457a4d09a8c2092278fec89a89" "000451f0869b7eab3582c093ace0253d" "0004c56ace95f92ee12bf9552401f923" ... $ time_on_page: num NaN NaN NaN NaN NaN ... ``` (It's not broken, it just so happened the first few sessions contained no non-NA time deltas). Finally, `session_count` and `session_length` provide easy ways of identifying how many sessions are in a sessionised dataset (overall, or on a per-user basis) and how long those sessions are, respectively. ### Other session functionality If you have ideas for other functionality that would make processing sessionseasier, the best approach is to either [request it](https://github.com/Ironholds/reconstructr/issues) or [add it](https://github.com/Ironholds/reconstructr/pulls)!