URLs are treated, by base R, as nothing more than components of a data retrieval process: they exist to create connections to retrieve datasets. This is an essential feature for the language to have, but it also means that URL handlers are designed for situations where URLs get you to the data - not situations where URLs are the data.
There is no support for encoding or decoding URLs en-masse, and no
support for parsing and interpreting them. urltools
provides this support!
Base R provides two functions - URLdecode
and
URLencode
- for taking percentage-encoded URLs and turning
them into regular strings, or vice versa. As discussed, these are
primarily designed to enable connections, and so they have several
inherent limitations, including a lack of vectorisation, that make them
unsuitable for large datasets.
Not only are they not vectorised, they also have several particularly
idiosyncratic bugs and limitations: URLdecode
, for example,
breaks if the decoded value is out of range:
URLdecode("test%gIL")
Error in rawToChar(out) : embedded nul in string: '\0L'
In addition: Warning message:
In URLdecode("%gIL") : out-of-range values treated as 0 in coercion to raw
URLencode, on the other hand, encodes slashes on its most strict setting - without paying attention to where those slashes are: if we attempt to URLencode an entire URL, we get:
URLencode("https://en.wikipedia.org/wiki/Article", reserved = TRUE)
[1] "https%3a%2f%2fen.wikipedia.org%2fwiki%2fArticle"
That’s a completely unusable URL (or ewRL, if you will).
urltools replaces both functions with url_decode
and
url_encode
respectively:
library(urltools)
url_decode("test%gIL")
[1] "test"
url_encode("https://en.wikipedia.org/wiki/Article")
[1] "https://en.wikipedia.org%2fwiki%2fArticle"
As you can see, url_decode
simply excludes out-of-range
characters from consideration, while url_encode
detects
characters that make up part of the URLs scheme, and leaves them
unencoded. Both are extremely fast; with urltools
, you can
decode a vector of 1,000,000 URLs in 0.9 seconds.
Alongside these, we have functions for encoding and decoding the
‘punycode’ format of URLs - ones that are designed to be
internationalised and have unicode characters in them. These also take
one argument, a vector of URLs, and can be found at
puny_encode
and puny_decode
respectively.
Once you’ve got your nicely decoded (or encoded) URLs, it’s time to do something with them - and, most of the time, you won’t actually care about most of the URL. You’ll want to look at the scheme, or the domain, or the path, but not the entire thing as one string.
The solution is url_parse
, which takes a URL and breaks
it out into its RfC
3986 components: scheme, domain, port, path, query string and
fragment identifier. This is, again, fully vectorised, and can happily
be run over hundreds of thousands of URLs, rapidly processing them. The
results are provided as a data.frame, since most people use data.frames
to store data.
> parsed_address <- url_parse("https://en.wikipedia.org/wiki/Article")
> str(parsed_address)
'data.frame': 1 obs. of 6 variables:
$ scheme : chr "https"
$ domain : chr "en.wikipedia.org"
$ port : chr NA
$ path : chr "wiki/Article"
$ parameter: chr NA
$ fragment : chr NA
We can also perform the opposite of this operation with
url_compose
:
With the inclusion of a URL parser, we suddenly have the opportunity
for lubridate-style component getting and setting. Syntax is identical
to that of lubridate
, but uses URL components as function
names.
url <- "https://en.wikipedia.org/wiki/Article"
scheme(url)
"https"
scheme(url) <- "ftp"
url
"ftp://en.wikipedia.org/wiki/Article"
Fields that can be extracted or set are scheme
,
domain
, port
, path
,
parameters
and fragment
.
Once we’ve extracted a domain from a URL with domain
or
url_parse
, we can identify which bit is the domain name,
and which bit is the suffix:
> url <- "https://en.wikipedia.org/wiki/Article"
> domain_name <- domain(url)
> domain_name
[1] "en.wikipedia.org"
> str(suffix_extract(domain_name))
'data.frame': 1 obs. of 4 variables:
$ host : chr "en.wikipedia.org"
$ subdomain: chr "en"
$ domain : chr "wikipedia"
$ suffix : chr "org"
This relies on an internal database of public suffixes, accessible at
suffix_dataset
- we recognise, though, that this dataset
may get a bit out of date, so you can also pass the results of the
suffix_refresh
function, which retrieves an updated
dataset, to suffix_extract
:
domain_name <- domain("https://en.wikipedia.org/wiki/Article")
updated_suffixes <- suffix_refresh()
suffix_extract(domain_name, updated_suffixes)
We can do the same thing with top-level domains, with precisely the
same setup, except the functions and datasets are
tld_refresh
, tld_extract
and
tld_dataset
.
In the other direction we have host_extract
, which
retrieves, well, the host! If the URL has subdomains, it’ll be the
lowest-level subdomain. If it doesn’t, it’ll be the actual domain name,
without the suffixes:
Once a URL is parsed, it’s sometimes useful to get the value
associated with a particular query parameter. As an example, take the
URL
http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json
.
What pageID is being used? What is the export format? We can find out
with param_get
.
> str(param_get(urls = "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json",
parameter_names = c("pageid","export")))
'data.frame': 1 obs. of 2 variables:
$ pageid: chr "1023"
$ export: chr "json"
This isn’t the only function for query manipulation; we can also dynamically modify the values a particular parameter might have, or strip them out entirely.
To modify the values, we use param_set
:
url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
url <- param_set(url, key = "pageid", value = "12")
url
# [1] "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=12&export=json"
As you can see this works pretty well; it even works in situations where the URL doesn’t have a query yet:
url <- "http://en.wikipedia.org/wiki/api.php"
url <- param_set(url, key = "pageid", value = "12")
url
# [1] "http://en.wikipedia.org/wiki/api.php?pageid=12"
On the other hand we might have a parameter we just don’t want any
more - that can be handled with param_remove
, which can
take multiple parameters as well as multiple URLs:
If you have ideas for other URL handlers that would make your data processing easier, the best approach is to either request it or add it!