Quantcast
Viewing all articles
Browse latest Browse all 104

Using RSelenium to scrape a paginated HTML table

Feed: R-bloggers.
Author: Guillaume Pressiat.

[This article was first published on Guillaume Pressiat, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Trying to answer this question on stackoverflow about understat.com scraping
I was interested to take RSelenium for a spin.

Few years ago, Selenium and R weren’t particularly friends (Python+Selenium were more used for instance) but
it seems to have changed.
Package author and rOpenSci works and documentation did it.

After few tries with xpath spellings, I have found RSelenium pretty nice actually.
I share here some recipes in this context: when you want to scrape a paginated table that is
not purely HTML but a result of embedded javascript execution in browser.

A thing that wans’t particularly easy in Selenium at the beginning was how to extract sub-elements like html table code and not “source page as a whole”.
I have used innerHTML attribute for this.

This example explains how emulate clicks can be done to navigate from elements to others in the HTML page, and a more focus point on moving from page to page in a paginated table.

Here is a youtube video with subtitles I have made to illustrate it (no voice).

  • First step to follow is to download a selenium-server-xxx.jar file here,
    see this vignette.
  • and run in the terminal : java -jar selenium-server-standalone-xxx.jar
  • then you can inspect precisely elements of the HTML page code in browser and go back and forth between RStudio and the emulated browser (right click, inspect element)
  • at the end use rvest to parse html tables

for instance find an id like league-chemp that we are using with RSelenium:

Image may be NSFW.
Clik here to view.
capture html

elem_chemp .

Here is a gist/snippets on github.

Also see the gist embedded below.

# https://stackoverflow.com/q/67021563/10527496


# java -jar selenium-server-standalone-3.9.1.jar 


library(RSelenium)
library(tidyverse)
library(rvest)
library(httr)

remDr % 
  html_table() %>% .[[1]] %>% 
  slice(-1)


# find player table in html via xpath
elem_player_page_number % 
  html_nodes('li.page') %>% 
  html_attr('data-page') %>% 
  as.integer() %>% 
  max()


# move to this table via script
remDr$executeScript("arguments[0].scrollIntoView(true);", args = list(elem_player_page_number))

# or scroll at the bottom of page
# body_b % 
    html_table()
  
  message('Player table scraped, page ', i)
  results_player %>% 
    .[[1]] %>% 
    filter(!is.na(Apps)) %>% 
    return()
  
}

# one_table_at_a_time(3) %>% View
# loop over pages
resu % purrr::map_df(one_table_at_a_time)


Viewing all articles
Browse latest Browse all 104

Trending Articles