Webscraping in R with Rvest

How-to
Project
NLP
R
A tutorial on the Rvest package in R
Author

Mark Edney

Published

June 22, 2022

Web scraping has become an incredibly important tool in data science, as an easy way to generate new data. The main advantage is the automation of some pretty repetitive tasks. Web scrapping can also be a good way of keeping up with new data on a website, assuming it doesn’t have a big change in its HTML structure.

Introduction

This project is inspired from a YouTube video created by Thu Vu about at data scraping project about the Witcher books series. Her project utilizes python and Selenium. I love the book series and I loved the project idea. I’ve also had it on my backlog to learn the Rvest library for a while, so it seems like a great opportunity to combine these two interests.

Rather than completing the project on the Witcher series, I thought it would be interesting to explore another book series that I love in the Stormlight Archive by Brandon Sanderson. If you are not familiar with the series, it is an epic fantasy story sprawling over four main books at the time of the publishing of this post. Sanderson is a fantastic author and I feel that the Stormlight Archive is his best work.

For this project, I will scrap the Coppermind website for all the character names in the series. The Coppermind is a fan made Wiki site that covers the work of Brandon Sanderson. After retrieving all the character names, I will create a graph outlining the relationships between each character. This work will be done in a future post, so please look forward to it.

Inializaton

The first step is to download the Rvest library. This is done with the following code.

install.packages("rvest")

The rvest package is also installed if you have the tidyverse package installed. Loading the tidyverse package however will not load the rvest package, so they both need to be loaded separately.

library("tidyverse")
library("rvest")

Datascraping

To start the data scraping exercise, we need to save the URL of the website we would like to scrape. This is the URL for the character page for the Stormlight Archives series.

site <- read_html("https://coppermind.net/wiki/Category:Rosharans")

While on the website in your own browser, you right-click on the specific element you’re interested in scrapping and select inspect. This is at least the method used for Firefox, but it should be similar to other browsers.

From here, you have to do a little digging and a little experimentation to determine which HTML elements are important for the character list. It is pretty useful to have a strong understanding of HTML at this point. From my experimentation, I found that the list was contained within a div with the class “mw-category-group”. A div is a generic divider tag in HTML and can represent many things. I selected the elements with the following code:

names <- site %>% 
        html_elements("div.mw-category-group")

You use the html_elements command to select the all the elements for a specific HTML tag you pass. The addition of the “.mw-category-group”, specifies the selection to only divs with the specific class. The class is an attribute of the HTML tags, used to identify and group HTML elements together. I have found that this notation is the best way to filter elements.

Within the div elements, there is a further sub-structure for an element in the character list. The characters are contained within an <li> tag as a list item and as an <a> tag as a hyperlink within that list item. We can explore further into the HTML structure by selecting these elements. After the final structure is selected, we can use the html_attr function to return an attribute of the selected elements. The ‘title’ attribute stores the character name in the HTML. We could also the html_text2 function to return the text of the hyperlink, but I’ve found that the ‘title’ attribute is better structured.

names <- names %>%
        html_elements("li") %>%
        html_elements("a") %>%
        html_attr("title")

Data Cleaning

We can start exploring the results of the scrapping

print(head(names))
[1] "Category:Aimians"    "Category:Alethi"     "Category:Azish"     
[4] "Category:Emuli"      "Category:Herdazians" "Category:Iriali"    

Oops, the program has captured an additional list that precedes the character list. Through my testing, I have not found a way to distinguish between the two lists from the HTML structure. Thankfully, we can rely on Regular Expressions to complete the job. The unwanted list items all start with “Category:”, so with a single expression of the str_starts from the stringr package we can remove these elements.

names <- names[!str_starts(names, "Category:")]

The list still requires some additional work, as there are “()” used throughout the list to give additional context. These “()” will not appear in the text, so we need to remove them with a second Regular Expression. Although it is not clear to me, the “(” needs to be double escaped with two “\” rather than just one.

names <- str_remove_all(names," \\(.*\\)")

print(head(names, 10))
 [1] "Abaray"        "Abiajan"       "Abrial"        "Abrobadar"    
 [5] "Abronai"       "Abry"          "Acis"          "Adin"         
 [9] "Adis"          "Adolin Kholin"

We can see that the scrapped data is in much better condition. There is still additional work we can do, as the names will sometimes include first and last names. The last names are not particularly important, so we can drop them altogether.

names <- str_remove_all(names," .*") 

print(head(names, 10))
 [1] "Abaray"    "Abiajan"   "Abrial"    "Abrobadar" "Abronai"   "Abry"     
 [7] "Acis"      "Adin"      "Adis"      "Adolin"   
names <- names[!names %in% c("Word","She", "User:Thurin")]

Now we can see that the final list is in condition that we can use to better explore the relationships.

Conclusion

We have scraped the Stormlight Archive character wiki website with the rvest package. We loaded the website with read_html function. Furthermore, we were then able to sort through the different HTML elements with the html_elements to find where the character list is stored. We then obtained the actual names with the html_attr function. The data collected still contained some unwanted data. We were able to remove an additional list, data in parentheses and the last names of all characters. We can now move forward with scrapping the books to identify the strength of relationships between each character.

Photo by Paz Arando on Unsplash