Title: | Parallel Dynamic Web-Scraping Using 'RSelenium' |
---|---|
Description: | A system to increase the efficiency of dynamic web-scraping with 'RSelenium' by leveraging parallel processing. You provide a function wrapper for your 'RSelenium' scraping routine with a set of inputs, and 'parsel' runs it in several browser instances. Chunked input processing as well as error catching and logging ensures seamless execution and minimal data loss, even when unforeseen 'RSelenium' errors occur. You can additionally build safe scraping functions with minimal coding by utilizing constructor functions that act as wrappers around 'RSelenium' methods. |
Authors: | Till Tietz [cre, aut] |
Maintainer: | Till Tietz <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.0 |
Built: | 2024-11-10 05:15:18 UTC |
Source: | https://github.com/till-tietz/parsel |
pipe-like operator that passes the output of lhs to the prev argument of rhs to paste together a scraper function in sequence.
lhs %>>% rhs
lhs %>>% rhs
lhs |
a parsel constructor function call |
rhs |
a parsel constructor function call that should accept lhs as its prev argument |
the output of rhs evaluated with lhs as the prev argument
## Not run: #paste together the go and goback output in sequence go("https://www.wikipedia.org/") %>>% goback() ## End(Not run)
## Not run: #paste together the go and goback output in sequence go("https://www.wikipedia.org/") %>>% goback() ## End(Not run)
generates the scraping function defined by start_scraper and other constructors in your environment
build_scraper(prev = NULL)
build_scraper(prev = NULL)
prev |
a placeholder for the output of functions being piped into show(). Defaults to NULL and should not be altered. |
a function
## Not run: start_scraper(args = c("x"), name = "fun") %>>% go("x") %>>% build_scraper() ## End(Not run)
## Not run: start_scraper(args = c("x"), name = "fun") %>>% go("x") %>>% build_scraper() ## End(Not run)
wrapper around clickElement() method to generate safe scraping code
click(using, value, name = NULL, new_page = FALSE, prev = NULL)
click(using, value, name = NULL, new_page = FALSE, prev = NULL)
using |
character string specifying locator scheme to use to search elements. Available schemes: "class name", "css selector", "id", "name", "link text", "partial link text", "tag name", "xpath". |
value |
character string specifying the search target. |
name |
character string specifying the object name the RSelenium "wElement" class object should be saved to. |
new_page |
logical indicating if clickElement() action will result in a change in url. |
prev |
a placeholder for the output of functions being piped into click(). Defaults to NULL and should not be altered. |
a character string defining 'RSelenium' clicking instructions that can be pasted into a scraping function.
## Not run: #navigate to wikipedia, click random article parsel::go("https://www.wikipedia.org/") %>>% parsel::click(using = "id", value = "'n-randompage'") %>>% show() ## End(Not run)
## Not run: #navigate to wikipedia, click random article parsel::go("https://www.wikipedia.org/") %>>% parsel::click(using = "id", value = "'n-randompage'") %>>% show() ## End(Not run)
wrapper around getElementText() method to generate safe scraping code
get_element(using, value, name = NULL, multiple = FALSE, prev = NULL)
get_element(using, value, name = NULL, multiple = FALSE, prev = NULL)
using |
character string specifying locator scheme to use to search elements. Available schemes: "class name", "css selector", "id", "name", "link text", "partial link text", "tag name", "xpath". |
value |
character string specifying the search target. |
name |
character string specifying the object name the RSelenium "wElement" class object should be saved to. If NULL a name will be generated automatically. |
multiple |
logical indicating whether multiple elements should be returned. If TRUE the findElements() method will be invoked. |
prev |
a placeholder for the output of functions being piped into get_element(). Defaults to NULL and should not be altered. |
a character string defining 'RSelenium' getElementText() instructions that can be pasted into a scraping function.
## Not run: #navigate to wikipedia, type "Hello" into the search box, #press enter, get page header parsel::go("https://www.wikipedia.org/") %>>% parsel::type(using = "id", value = "'searchInput'", name = "searchbox", text = c("Hello","\uE007")) %>>% parsel::get_element(using = "id", value = "'firstHeading'", name = "header") %>>% show() #navigate to wikipedia, type "Hello" into the search box, press enter, #get page header, save in external data.frame x. parsel::go("https://www.wikipedia.org/") %>>% parsel::type(using = "id", value = "'searchInput'", name = "searchbox", text = c("Hello","\uE007")) %>>% parsel::get_element(using = "id", value = "'firstHeading'", name = "x[,1]") %>>% show() ## End(Not run)
## Not run: #navigate to wikipedia, type "Hello" into the search box, #press enter, get page header parsel::go("https://www.wikipedia.org/") %>>% parsel::type(using = "id", value = "'searchInput'", name = "searchbox", text = c("Hello","\uE007")) %>>% parsel::get_element(using = "id", value = "'firstHeading'", name = "header") %>>% show() #navigate to wikipedia, type "Hello" into the search box, press enter, #get page header, save in external data.frame x. parsel::go("https://www.wikipedia.org/") %>>% parsel::type(using = "id", value = "'searchInput'", name = "searchbox", text = c("Hello","\uE007")) %>>% parsel::get_element(using = "id", value = "'firstHeading'", name = "x[,1]") %>>% show() ## End(Not run)
wrapper around remDr$navigate method to generate safe navigation code
go(url, prev = NULL)
go(url, prev = NULL)
url |
a character string specifying the name of the object holding the url string or the url string the function should navigate to. |
prev |
a placeholder for the output of functions being piped into go(). Defaults to NULL and should not be altered. |
a character string defining 'RSelenium' navigation instructions that can be pasted into a scraping function
## Not run: go("https://www.wikipedia.org/") %>>% show() ## End(Not run)
## Not run: go("https://www.wikipedia.org/") %>>% show() ## End(Not run)
wrapper around remDr$goBack method to generate safe backwards navigation code
goback(prev = NULL)
goback(prev = NULL)
prev |
a placeholder for the output of functions being piped into goback(). Defaults to NULL and should not be altered. |
a character string defining 'RSelenium' backwards navigation instructions that can be pasted into a scraping function
## Not run: goback() %>>% show() ## End(Not run)
## Not run: goback() %>>% show() ## End(Not run)
wrapper around remDr$goForward method to generate safe forwards navigation code
goforward(prev = NULL)
goforward(prev = NULL)
prev |
a placeholder for the output of functions being piped into goforward(). Defaults to NULL and should not be altered. |
a character string defining 'RSelenium' forward navigation instructions that can be pasted into a scraping function.
## Not run: goforward() %>>% show() ## End(Not run)
## Not run: goforward() %>>% show() ## End(Not run)
parallelize execution of RSelenium
parscrape( scrape_fun, scrape_input, cores = NULL, packages = c("base"), browser, ports = NULL, chunk_size = NULL, scrape_tries = 1, proxy = NULL, extraCapabilities = list() )
parscrape( scrape_fun, scrape_input, cores = NULL, packages = c("base"), browser, ports = NULL, chunk_size = NULL, scrape_tries = 1, proxy = NULL, extraCapabilities = list() )
scrape_fun |
a function with input x sending instructions to remDr (remote driver)/ scraping function to be parallelized |
scrape_input |
a data frame, list, or vector where each element is an input to be passed to scrape_fun |
cores |
number of cores to run RSelenium instances on. Defaults to available cores - 1. |
packages |
a character vector with package names of packages used in scrape_fun |
browser |
a character vector specifying the browser to be used |
ports |
vector of ports for RSelenium instances. If left at default NULL parscrape will randomly generate ports. |
chunk_size |
number of scrape_input elements to be processed per round of scrape_fun. parscrape splits scrape_input into chunks and runs scrape_fun in multiple rounds to avoid loosing data due to errors. Defaults to number of cores. |
scrape_tries |
number of times parscrape will re-try to scrape a chunk when encountering an error |
proxy |
a proxy setting function that runs before scraping each chunk |
extraCapabilities |
a list of extraCapabilities options to be passed to rsDriver |
a list containing the elements: scraped_results and not_scraped. scraped_results is a list containing the output of scrape_fun. If there are no unscraped input elements then not_scraped is NULL. If there are unscraped elements not_scraped is a data.frame containing the scrape_input id, chunk id and associated error of all unscraped input elements.
## Not run: input <- c(".central-textlogo__image",".central-textlogo__image") scrape_fun <- function(x){ input_i <- x remDr$navigate("https://www.wikipedia.org/") element <- remDr$findElement(using = "css", input_i) element <- element$getElementText() return(element) } parsel_out <- parscrape(scrape_fun = scrape_fun, scrape_input = input, cores = 2, packages = c("RSelenium"), browser = "firefox", scrape_tries = 1, chunk_size = 2, extraCapabilities = list( "moz:firefoxOptions" = list(args = list('--headless')) ) ) ## End(Not run)
## Not run: input <- c(".central-textlogo__image",".central-textlogo__image") scrape_fun <- function(x){ input_i <- x remDr$navigate("https://www.wikipedia.org/") element <- remDr$findElement(using = "css", input_i) element <- element$getElementText() return(element) } parsel_out <- parscrape(scrape_fun = scrape_fun, scrape_input = input, cores = 2, packages = c("RSelenium"), browser = "firefox", scrape_tries = 1, chunk_size = 2, extraCapabilities = list( "moz:firefoxOptions" = list(args = list('--headless')) ) ) ## End(Not run)
renders the output of the piped functions to the console via cat()
show(prev = NULL)
show(prev = NULL)
prev |
a placeholder for the output of functions being piped into show(). Defaults to NULL and should not be altered. |
None (invisible NULL)
## Not run: go("https://www.wikipedia.org/") %>>% goback() %>>% show() ## End(Not run)
## Not run: go("https://www.wikipedia.org/") %>>% goback() %>>% show() ## End(Not run)
sets function name and arguments of scraping function
start_scraper(args, name = NULL)
start_scraper(args, name = NULL)
args |
a character vector of function arguments |
name |
character string specifying the object name of the scraping function. If NULL defaults to 'scraper' |
a character string starting a function definition
## Not run: start_scraper(args = c("x","y"), name = "fun") ## End(Not run)
## Not run: start_scraper(args = c("x","y"), name = "fun") ## End(Not run)
wrapper around sendKeysToElement() method to generate safe scraping code
type( using, value, name = NULL, text, text_object, new_page = FALSE, prev = NULL )
type( using, value, name = NULL, text, text_object, new_page = FALSE, prev = NULL )
using |
character string specifying locator scheme to use to search elements. Available schemes: "class name", "css selector", "id", "name", "link text", "partial link text", "tag name", "xpath". |
value |
character string specifying the search target. |
name |
character string specifying the object name the RSelenium "wElement" class object should be saved to.If NULL a name will be generated automatically. |
text |
a character vector specifying the text to be typed. |
text_object |
a character string specifying the name of an external object holding the text to be typed. Note that the remDr$sendKeysToElement method only accepts list inputs. |
new_page |
logical indicating if sendKeysToElement() action will result in a change in url. |
prev |
a placeholder for the output of functions being piped into type(). Defaults to NULL and should not be altered. |
a character string defining 'RSelenium' typing instructions that can be pasted into a scraping function.
## Not run: #navigate to wikipedia, type "Hello" into the search box, press enter parsel::go("https://www.wikipedia.org/") %>>% parsel::type(using = "id", value = "'searchInput'", name = "searchbox", text = c("Hello","\uE007")) %>>% show() #navigate to wikipeda, type content stored in external object "x" into search box parsel::go("https://www.wikipedia.org/") %>>% parsel::type(using = "id", value = "'searchInput'", name = "searchbox", text_object = "x") %>>% show() ## End(Not run)
## Not run: #navigate to wikipedia, type "Hello" into the search box, press enter parsel::go("https://www.wikipedia.org/") %>>% parsel::type(using = "id", value = "'searchInput'", name = "searchbox", text = c("Hello","\uE007")) %>>% show() #navigate to wikipeda, type content stored in external object "x" into search box parsel::go("https://www.wikipedia.org/") %>>% parsel::type(using = "id", value = "'searchInput'", name = "searchbox", text_object = "x") %>>% show() ## End(Not run)