Title: | Vectorised Tools for URL Handling and Parsing |
---|---|
Description: | A toolkit for all URL-handling needs, including encoding and decoding, parsing, parameter extraction and modification. All functions are designed to be both fast and entirely vectorised. It is intended to be useful for people dealing with web-related datasets, such as server-side logs, although may be useful for other situations involving large sets of URLs. |
Authors: | Os Keyes [aut, cre], Jay Jacobs [aut, cre], Drew Schmidt [aut], Mark Greenaway [ctb], Bob Rudis [ctb], Alex Pinto [ctb], Maryam Khezrzadeh [ctb], Peter Meilstrup [ctb], Adam M. Costello [cph], Jeff Bezanson [cph], Peter Meilstrup [ctb], Xueyuan Jiang [ctb] |
Maintainer: | Os Keyes <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.7.3 |
Built: | 2024-11-06 04:26:41 UTC |
Source: | https://github.com/ironholds/urltools |
as in the lubridate package, individual components of a URL can be both extracted or set using the relevant function call - see the examples.
domain(x) domain(x) <- value
domain(x) domain(x) <- value
x |
a URL, or vector of URLs |
value |
a replacement value (or vector of replacement values) for x's scheme. |
scheme
, port
, path
,
parameters
and fragment
for other accessors.
#Get a component example_url <- "http://cran.r-project.org/submit.html" domain(example_url) #Set a component domain(example_url) <- "en.wikipedia.org"
#Get a component example_url <- "http://cran.r-project.org/submit.html" domain(example_url) #Set a component domain(example_url) <- "en.wikipedia.org"
as in the lubridate package, individual components of a URL can be both extracted or set using the relevant function call - see the examples.
fragment(x) fragment(x) <- value
fragment(x) fragment(x) <- value
x |
a URL, or vector of URLs |
value |
a replacement value (or vector of replacement values) for x's fragment. If NULL, the fragment will be removed entirely. |
scheme
, domain
, port
,
path
and parameters
for other accessors.
#Get a component example_url <- "http://en.wikipedia.org/wiki/Aaron_Halfaker?debug=true#test" fragment(example_url) #Set a component fragment(example_url) <- "production" #Remove a component fragment(example_url) <- NULL
#Get a component example_url <- "http://en.wikipedia.org/wiki/Aaron_Halfaker?debug=true#test" fragment(example_url) #Set a component fragment(example_url) <- "production" #Remove a component fragment(example_url) <- NULL
host_extract
extracts the host from
a vector of domain names. A host isn't the same as a domain - it could be
the subdomain, if there are one or more subdomains. The host of en.wikipedia.org
is en
, while the host of wikipedia.org
is wikipedia
.
host_extract(domains)
host_extract(domains)
domains |
a data.frame of two columns: domain
, with the original domain names,
and host
, the identified host from the domain.
# With subdomains has_subdomain <- domain("https://en.wikipedia.org/wiki/Main_Page") host_extract(has_subdomain) # Without no_subdomain <- domain("https://ironholds.org/projects/r_shiny/") host_extract(no_subdomain)
# With subdomains has_subdomain <- domain("https://en.wikipedia.org/wiki/Main_Page") host_extract(has_subdomain) # Without no_subdomain <- domain("https://ironholds.org/projects/r_shiny/") host_extract(no_subdomain)
URLs can have parameters, taking the form of name=value
, chained together
with &
symbols. param_get
, when provided with a vector of URLs and a vector
of parameter names, will generate a data.frame consisting of the values of each parameter
for each URL.
param_get(urls, parameter_names = NULL)
param_get(urls, parameter_names = NULL)
urls |
a vector of URLs |
parameter_names |
a vector of parameter names. If |
a data.frame containing one column for each provided parameter name. Values that cannot be found within a particular URL are represented by an NA.
url_parse
for decomposing URLs into their constituent parts and
param_set
for inserting or modifying key/value pairs within a query string.
#A very simple example url <- "https://google.com:80/foo.php?this_parameter=selfreferencing&hiphop=awesome" parameter_values <- param_get(url, c("this_parameter","hiphop"))
#A very simple example url <- "https://google.com:80/foo.php?this_parameter=selfreferencing&hiphop=awesome" parameter_values <- param_get(url, c("this_parameter","hiphop"))
URLs often have queries associated with them, particularly URLs for
APIs, that look like ?key=value&key=value&key=value
. param_remove
allows you to remove key/value pairs while leaving the rest of the URL intact.
param_remove(urls, keys)
param_remove(urls, keys)
urls |
a vector of URLs. These should be decoded with |
keys |
a vector of parameter keys to remove. |
the original URLs but with the key/value pairs specified by keys
removed.
If the original URL is NA
, NA
will be returned; if a specified key is NA
,
nothing will be done with it.
param_set
to modify values associated with keys, or param_get
to retrieve those values.
# Remove multiple parameters from a URL param_remove(urls = "https://en.wikipedia.org/wiki/api.php?action=list&type=query&format=json", keys = c("action","format"))
# Remove multiple parameters from a URL param_remove(urls = "https://en.wikipedia.org/wiki/api.php?action=list&type=query&format=json", keys = c("action","format"))
URLs often have queries associated with them, particularly URLs for
APIs, that look like ?key=value&key=value&key=value
. param_set
allows you to modify key/value pairs within query strings, or even add new ones
if they don't exist within the URL.
param_set(urls, key, value)
param_set(urls, key, value)
urls |
a vector of URLs. These should be decoded (with |
key |
a string representing the key to modify the value of (or insert wholesale if it doesn't exist within the URL). |
value |
a value to associate with the key. This can be a single string,
or a vector the same length as |
the original vector of URLs, but with modified/inserted key-value pairs. If the
URL is NA
, the returned value will be - if the key or value are, no insertion
will be made.
param_get
to retrieve the values associated with multiple keys in
a vector of URLs, and param_remove
to strip key/value pairs from a URL entirely.
# Set a URL parameter where there's already a key for that param_set("https://en.wikipedia.org/api.php?action=query", "action", "pageinfo") # Set a URL parameter where there isn't. param_set("https://en.wikipedia.org/api.php?list=props", "action", "pageinfo")
# Set a URL parameter where there's already a key for that param_set("https://en.wikipedia.org/api.php?action=query", "action", "pageinfo") # Set a URL parameter where there isn't. param_set("https://en.wikipedia.org/api.php?list=props", "action", "pageinfo")
as in the lubridate package, individual components of a URL can be both extracted or set using the relevant function call - see the examples.
parameters(x) parameters(x) <- value
parameters(x) parameters(x) <- value
x |
a URL, or vector of URLs |
value |
a replacement value (or vector of replacement values) for x's parameters. If NULL, the parameters will be removed entirely. |
scheme
, domain
, port
,
path
and fragment
for other accessors.
# Get the parameters example_url <- "http://en.wikipedia.org/wiki/Aaron_Halfaker?debug=true" parameters(example_url) # Set the parameters parameters(example_url) <- "debug=false" # Remove the parameters parameters(example_url) <- NULL
# Get the parameters example_url <- "http://en.wikipedia.org/wiki/Aaron_Halfaker?debug=true" parameters(example_url) # Set the parameters parameters(example_url) <- "debug=false" # Remove the parameters parameters(example_url) <- NULL
as in the lubridate package, individual components of a URL can be both extracted or set using the relevant function call - see the examples.
path(x) path(x) <- value
path(x) path(x) <- value
x |
a URL, or vector of URLs |
value |
a replacement value (or vector of replacement values) for x's path. If NULL, the path will be removed entirely. |
scheme
, domain
, port
,
parameters
and fragment
for other accessors.
# Get the path example_url <- "http://cran.r-project.org:80/submit.html" path(example_url) # Set the path path(example_url) <- "bin/windows/" # Remove the path path(example_url) <- NULL
# Get the path example_url <- "http://cran.r-project.org:80/submit.html" path(example_url) # Set the path path(example_url) <- "bin/windows/" # Remove the path path(example_url) <- NULL
as in the lubridate package, individual components of a URL can be both extracted or set using the relevant function call - see the examples.
port(x) port(x) <- value
port(x) port(x) <- value
x |
a URL, or vector of URLs |
value |
a replacement value (or vector of replacement values) for x's port. If NULL, the port will be entirely removed. |
scheme
, domain
, path
,
parameters
and fragment
for other accessors.
# Get the port example_url <- "http://cran.r-project.org:80/submit.html" port(example_url) # Set the port port(example_url) <- "12" # Remove the port port(example_url) <- NULL
# Get the port example_url <- "http://cran.r-project.org:80/submit.html" port(example_url) # Set the port port(example_url) <- "12" # Remove the port port(example_url) <- NULL
puny_encode
and puny_decode
implement
the encoding standard for internationalised (non-ASCII) domains and
subdomains. You can use them to encode UTF-8 domain names, or decode
encoded names (which start "xn–"), or both.
puny_encode(x) puny_decode(x)
puny_encode(x) puny_decode(x)
x |
a vector of URLs. These should be URL decoded using |
a CharacterVector containing encoded or decoded versions of the entries in x
.
Invalid URLs (ones that are NA
, or ones that do not successfully map to an actual
decoded or encoded version) will be returned as NA
.
url_decode
and url_encode
for percent-encoding.
# Encode a URL puny_encode("https://www.bücher.com/foo") # Decode the result, back to the original puny_decode("https://www.xn--bcher-kva.com/foo")
# Encode a URL puny_encode("https://www.bücher.com/foo") # Decode the result, back to the original puny_decode("https://www.xn--bcher-kva.com/foo")
as in the lubridate package, individual components of a URL can be both extracted or set using the relevant function call - see the examples.
scheme(x) scheme(x) <- value
scheme(x) scheme(x) <- value
x |
a URL, or vector of URLs |
value |
a replacement value (or vector of replacement values) for x's scheme. |
domain
, port
, path
,
parameters
and fragment
for other accessors.
#Get a component example_url <- "http://cran.r-project.org/submit.html" scheme(example_url) #Set a component scheme(example_url) <- "https" # NA out the URL scheme(example_url) <- NA_character_
#Get a component example_url <- "http://cran.r-project.org/submit.html" scheme(example_url) #Set a component scheme(example_url) <- "https" # NA out the URL scheme(example_url) <- NA_character_
authentication credentials appear before the domain
name and look like user:password. Sometimes you want the removed,
or retrieved; strip_credentials
and get_credentials
do
precisely that
strip_credentials(urls) get_credentials(urls)
strip_credentials(urls) get_credentials(urls)
urls |
a URL, or vector of URLs |
# Remove credentials strip_credentials("http://foo:[email protected]:3128") # Get credentials get_credentials("http://foo:[email protected]:3128")
# Remove credentials strip_credentials("http://foo:[email protected]:3128") # Get credentials get_credentials("http://foo:[email protected]:3128")
This dataset contains a registry of public suffixes, as retrieved from
and defined by the public suffix list. It is
sorted by how many periods(".") appear in the suffix, to optimise it for
suffix_extract
. It is a data.frame with two columns, the first is
the list of suffixes and the second is our best guess at the comment or owner
associated with the particular suffix.
data(suffix_dataset)
data(suffix_dataset)
A data.frame of 8030 rows and 2 columns
Last updated 2016-07-31.
suffix_extract
for extracting suffixes from domain names,
and suffix_refresh
for getting a new, totally-up-to-date dataset
version.
domain names have suffixes - common endings that people can or could register domains under. This includes things like ".org", but also things like ".edu.co". A simple Top Level Domain list, as a result, probably won't cut it.
suffix_extract
takes the list of public suffixes,
as maintained by Mozilla (see suffix_dataset
) and
a vector of domain names, and produces a data.frame containing the
suffix that each domain uses, and the remaining fragment.
suffix_extract(domains, suffixes = NULL)
suffix_extract(domains, suffixes = NULL)
domains |
a vector of damains, from |
suffixes |
a dataset of suffixes. By default, this is NULL and the function
relies on |
a data.frame of four columns, "host" "subdomain", "domain" & "suffix". "host" is what was passed in. "subdomain" is the subdomain of the suffix. "domain" contains the part of the domain name that came before the matched suffix. "suffix" is, well, the suffix.
suffix_dataset
for the dataset of suffixes.
# Using url_parse domain_name <- url_parse("http://en.wikipedia.org")$domain suffix_extract(domain_name) # Using domain() domain_name <- domain("http://en.wikipedia.org") suffix_extract(domain_name) ## Not run: #Relying on a fresh version of the suffix dataset suffix_extract(domain("http://en.wikipedia.org"), suffix_refresh()) ## End(Not run)
# Using url_parse domain_name <- url_parse("http://en.wikipedia.org")$domain suffix_extract(domain_name) # Using domain() domain_name <- domain("http://en.wikipedia.org") suffix_extract(domain_name) ## Not run: #Relying on a fresh version of the suffix dataset suffix_extract(domain("http://en.wikipedia.org"), suffix_refresh()) ## End(Not run)
urltools
comes with an inbuilt
dataset of public suffixes, suffix_dataset
.
This is used in suffix_extract
to identify the top-level domain
within a particular domain name.
While updates to the dataset will be included in each new package release,
there's going to be a gap between changes to the suffixes list and changes to the package.
Accordingly, the package also includes suffix_refresh
, which generates
and returns a fresh version of the dataset. This can then be passed through
to suffix_extract
.
suffix_refresh()
suffix_refresh()
a dataset equivalent in format to suffix_dataset
.
suffix_extract
to extract suffixes from domain names,
or suffix_dataset
for the inbuilt, default version of the data.
## Not run: new_suffixes <- suffix_refresh() ## End(Not run)
## Not run: new_suffixes <- suffix_refresh() ## End(Not run)
This dataset contains a registry of top-level domains, as retrieved from and defined by the IANA.
data(tld_dataset)
data(tld_dataset)
A vector of 1275 elements.
Last updated 2016-07-20.
tld_extract
for extracting TLDs from domain names,
and tld_refresh
to get an updated version of this dataset.
tld_extract
extracts the top-level domain (TLD) from
a vector of domain names. This is distinct from the suffixes, extracted with
suffix_extract
; TLDs are top level, while suffixes are just
domains through which internet users can publicly register domains (the difference
between .org.uk
and .uk
).
tld_extract(domains, tlds = NULL)
tld_extract(domains, tlds = NULL)
domains |
|
tlds |
a dataset of TLDs. If NULL (the default), |
a data.frame of two columns: domain
, with the original domain names,
and tld
, the identified TLD from the domain.
suffix_extract
for retrieving suffixes (distinct from TLDs).
# Using the inbuilt dataset domains <- domain("https://en.wikipedia.org/wiki/Main_Page") tld_extract(domains) # Using a refreshed one tld_extract(domains, tld_refresh())
# Using the inbuilt dataset domains <- domain("https://en.wikipedia.org/wiki/Main_Page") tld_extract(domains) # Using a refreshed one tld_extract(domains, tld_refresh())
urltools
comes with an inbuilt
dataset of top level domains (TLDs), tld_dataset
.
This is used in tld_extract
to identify the top-level domain
within a particular domain name.
While updates to the dataset will be included in each new package release,
there's going to be a gap between changes to TLDs and changes to the package.
Accordingly, the package also includes tld_refresh
, which generates
and returns a fresh version of the dataset. This can then be passed through
to tld_extract
.
tld_refresh()
tld_refresh()
a dataset equivalent in format to tld_dataset
.
tld_extract
to extract suffixes from domain names,
or tld_dataset
for the inbuilt, default version of the data.
## Not run: new_tlds <- tld_refresh() ## End(Not run)
## Not run: new_tlds <- tld_refresh() ## End(Not run)
Sometimes you want to take a vector of URLs, parse them, perform
some operations and then rebuild them. url_compose
takes a data.frame produced
by url_parse
and rebuilds it into a vector of full URLs (or: URLs as full
as the vector initially thrown into url_parse).
This is currently a 'beta' feature; please do report bugs if you find them.
url_compose(parsed_urls)
url_compose(parsed_urls)
parsed_urls |
a data.frame sourced from |
scheme
and other accessors, which you may want to
run URLs through before composing them to modify individual values.
#Parse a URL and compose it url <- "http://en.wikipedia.org" url_compose(url_parse(url))
#Parse a URL and compose it url <- "http://en.wikipedia.org" url_compose(url_parse(url))
encodes or decodes a URI/URL
url_decode(urls) url_encode(urls)
url_decode(urls) url_encode(urls)
urls |
a vector of URLs to decode or encode. |
URL encoding and decoding is an essential prerequisite to proper web interaction and data analysis around things like server-side logs. The relevant IETF RfC mandates the percentage-encoding of non-Latin characters, including things like slashes, unless those are reserved.
Base R provides URLdecode
and URLencode
, which handle
URL encoding - in theory. In practise, they have a set of substantial problems
that the urltools implementation solves::
No vectorisation: Both base R functions operate on single URLs, not vectors of URLs. This means that, when confronted with a vector of URLs that need encoding or decoding, your only option is to loop from within R. This can be incredibly computationally costly with large datasets. url_encode and url_decode are implemented in C++ and entirely vectorised, allowing for a substantial performance improvement.
No scheme recognition: encoding the slashes in, say, http://, is a good way of making sure your URL no longer works. Because of this, the only thing you can encode in URLencode (unless you refuse to encode reserved characters) is a partial URL, lacking the initial scheme, which requires additional operations to set up and increases the complexity of encoding or decoding. url_encode detects the protocol and silently splits it off, leaving it unencoded to ensure that the resulting URL is valid.
ASCII NULs: Server side data can get very messy and sometimes include out-of-range
characters. Unfortunately, URLdecode's response to these characters is to convert
them to NULs, which R can't handle, at which point your URLdecode call breaks.
url_decode
simply ignores them.
a character vector containing the encoded (or decoded) versions of "urls".
puny_decode
and puny_encode
, for punycode decoding
and encoding.
url_decode("https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_%28logo%29.jpg") url_encode("https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_(logo).jpg") ## Not run: #A demonstrator of the contrasting behaviours around out-of-range characters URLdecode("%gIL") url_decode("%gIL") ## End(Not run)
url_decode("https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_%28logo%29.jpg") url_encode("https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_(logo).jpg") ## Not run: #A demonstrator of the contrasting behaviours around out-of-range characters URLdecode("%gIL") url_decode("%gIL") ## End(Not run)
url_parse
takes a vector of URLs and splits each one into its component
parts, as recognised by RfC 3986.
url_parse(urls)
url_parse(urls)
urls |
a vector of URLs |
It's useful to be able to take a URL and split it out into its component parts -
for the purpose of hostname extraction, for example, or analysing API calls. This functionality
is not provided in base R, although it is provided in parse_url
; that
implementation is entirely in R, uses regular expressions, and is not vectorised. It's
perfectly suitable for the intended purpose (decomposition in the context of automated
HTTP requests from R), but not for large-scale analysis.
Note that user authentication/identification information is not extracted;
this can be found with get_credentials
.
a data.frame consisting of the columns scheme, domain, port, path, query and fragment. See the 'relevant IETF RfC for definitions. If an element cannot be identified, it is represented by an empty string.
param_get
for extracting values associated with particular keys in a URL's
query string, and url_compose
, which is url_parse
in reverse.
url_parse("https://en.wikipedia.org/wiki/Article")
url_parse("https://en.wikipedia.org/wiki/Article")
This package provides functions for URL encoding and decoding, parsing, and parameter extraction, designed to be both fast and entirely vectorised. It is intended to be useful for people dealing with web-related datasets, such as server-side logs.
the package vignette.