come for the viz, stay for the craft (or vice versa)

Data Blog

Web Scraping Part One

So you need a data set from a database on the web, and there doesn’t seem to be an easy way to download it? In this post I will run through a process I used to create a data set of heritage monuments and places from Monuments Australia, and how you can visualise them.

The skills you will learn here should be transferable to other sites as well.

This post is in three parts

  1. Getting set up for scraping and extracting the first list / data set you will need

  2. Extracting the full data set and cleaning it up

  3. Visualising!

Part 1 - Getting set up for scraping and extracting the first list / data set you will need

Just under twelve months ago I was asked by a historian at the university I am connected with, to create a data set of historical monuments and places named or dedicated to the voyage of James Cook and the Endeavor. I then needed to create visualisations of this data set.

The aim was to capture all the times Australian’s have found it necessary to memorialise or celebrate Cook and the voyage of the Endeavour, and to be able to see/visualise the peaks in monument dedication in time or location. The tools I built though ended up giving the researchers the ability to expand the search to any white/colonial historical figure of the past beyond Cook.

This post is not to share any of the insights we discovered, but to document my process. It was one of those projects where you organically create a workflow, and in time look back at your code and wonder what it does, and how the hell you got to where you did.

I’ll start off with the problems that you also may be facing right now if you stumbled across this post in a desperate search for help (this is exactly where I was at the start of this project as well!). The majority of the data we needed sits on databases created for state heritage bodies, and much of this is not easily accessed beyond pulling down one record at a time. Downloading all of the data relating to your query is rarely offered; but a shout out to QLD and ACT gov who from memory had a good and easy to navigate system allowing you to download your data. One state had an “easy search function” that was essentially a PDF that you downloaded and did a text search on. The site I really needed to access was impenetrable. While they had developed a ridiculously complex mapping tool, simply getting a data set to run your own analysis was (and still is) impossible. I wrote to one of the IT guys there, cognisant that many sites are not keen for you to slam their system with a scraper, as well as being hopeful he’d just send me the data in a CSV. He instructed me to build a scraper. Thanks for that…so much for “open data”.

As luck would have it, the day I sat down to make a start do it, this government site was down. So, instead I decided to get data from Monument Australia, a wonderful self-funded site that is run by a couple. (and if you end up using them for any research or data, make a donation to help this site keep running).

The example I will take you through uses R. I am assuming that you have some basic knowledge of R. I am also assuming that you have a basic working knowledge of HTML, and you can use the inspector tool in your browser. My example also uses chrome. Why R? I knew that I’d be working up a lot of the data visualisations using ggplot2, and I like to keep everything in the same workflow if I can. When I do something new, my first question normally is “Can I do this in R?” and most of the times the answer is yes.

I have a smart friend who set me in the right direction with the packages I’d likely need to use, and I did find some blog posts that helped me along the way. It’s over 12 months now and for the life of me I cannot find them, so just know none of this is “invented” by me, and I am standing on the shoulders of others ;-)

My smart friend (hello Dan) also gave me some good advice; to break down everything I need to do in little steps.

  1. The first was to understand what I needed to do at the end and think about the data that I needed and how t would be structured

  2. The second was to break down all the small steps I needed to do as I wrote the R scrpt for my scraper. This evolved as I went along and understood a bit more about what I was doing, so with the benefit of hindsight I can say the following steps were what I needed to do;

    • navigate to the page I needed

    • enter the search terms I needed

    • perform the search

    • extract a list of the monuments I needed more information on

    • Use the list to instruct the scraper to go to each of the entries and extract the information for each monument

    • Combine and reformat all this information into one file

What data did I need?

My first step i to “mock up” what my data set needed to look like. I wanted to have a clear idea of what I needed before I went down any rabbit hole. The data needed to go to a researcher who cross-validated every site or monument, to ensure that we only had monuments for James Cook of the Endeavor and not James Cook the local mayor. Basically a row per record, and this needed to include the address, the geo-location, the year the monument was dedicated and any other tasty info we could gather. (I also later expanded the the search beyond Cook to Banks, Solander and Endeavor). I also did some sketches of the types of visualisations I’d like to create, and thought about the data and data structures I’d need to make them.

Screen Shot 2019-07-25 at 10.56.14 am.png

Monument Australia was a great place to start. I went to their advanced search page and used Cook in the keyword, and a list of records was returned. If you do the same, you will see that many of them are not related to James Cook, but at this stage, we just need to create the list, so record accuracy is not our goal right now.

If you also do the same search, you can see that this initial view does not give us all the information I needed, but if we click into the record, it’s all there. So my goal at first was simply to create a list of records to use in a loop function to grab the rest of the data.

Now that I had a rough idea of what I needed, I had to work out how to do it. Again, I had some tips from other blog posts plus a list of packages that could help me.

Setting up your environment in R

Install and load the following packages:

  1. httr

  2. RSelenium

  3. rvest

  4. xml2

  5. tidyverse

  6. lubridate

**Note: when I started this up yesterday I was having some issues with curl which is part of httr. I managed to get around this by selecting NO when installing httr at the prompt.

You also need to start Docker, so install that if you don’t have it, and run the following command in your terminal to get it going (do not run this in R)l;

docker run -d -p 4445:4444 selenium/standalone-chrome

then run;

docker ps

In R, run the following code which will open up a server via docker

remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
                                 port = 4445L,
                                 browserName = "chrome")
remDr$open()

If all goes well, you should be able to see in the console “connecting to remote server” and a series of returns after this. All good!

Navigating to the page you need to scrape

The next thing is to navigate to the Monument Australia website search page. We will also take a screenshot and send it to your Viewer (I am using RStudio).

remDr$navigate("http://monumentaustralia.org.au/search")
##give the page time to load
Sys.sleep(5)
##take a screenshot to check
remDr$screenshot(display = TRUE)

All you are doing here is using the object you created called remDR to navigate and “go to this webpage”. You are also giving the page a chance to load fully via Sys.sleep, and then taking a screenshot which will appear in your viewer. It’s a bit of “visual payoff” to see that it’s working, as well as a check you are going to the right page.

Entering in the search term and performing the search

Next, we want to enter in our search term, and we will get R to do it for us, instead of entering this in with our keyboard. But first need to work out what we need to tell it to do.

Open up the webpage in your browser, and open your inspector. I need to get “Cook” somehow in the keyword field. I played around a bit at this stage as for me, the Town field appears to be set on VIC and not blank, but this didn’t impact the search results. If it did I would have had to muck about a bit more with some keyboard commands, but luckily no need.

When I inspect the page in my browser, I can see that “Keyword” sits in a table in a row (denoted by the <td> tag) Next to it there is a text input field. We need to create a command to look for the text field that has been given an attribute using “name=keyword”.

Screen Shot 2019-07-25 at 11.04.18 am.png

We do this by running the following code snippet:

searchname_field <- remDr$findElement(using = 'name', value = "keyword")
searchname_field$sendKeysToElement(list("\b", "Cook", "\uE007"))
Sys.sleep(5)
remDr$screenshot(display = TRUE)

This creates an object called “searchname_field” that contains a command that will again use remDR to navigate using findElement to the right place by looking for the attributes that sit in the html tag; name with the value of “keyword”.

The next line is telling R navigate to this text box, and enter in “Cook”. We have told R to go to searchname_field based on our specifications, and then “type” via sendKeysToElement our search term. I am not sure why, but there was always a 1 before “Cook” when I did this, so “\b” just says to backspace, and "\uE007" is “enter”.

(This is a very handy page for other common commands you may need to do use get around your page using Selenium.)

Finally, we give the page again a chance to load (hello slow internet at home), and take a screenshot so you can see it’s worked in your Viewer.

Your screen shot should show that the right search has taken place. If you have a look at your own browser by doing this manually, it’s pretty clear that some of the items returned are not about Cook or the Endeavor but somewhere contain the word Cook. That’s fine as at this stage we are not concerned about accuracy, just about compiling our list.

It also helps at this point to have a quick / manual go at the search yourself in your browser and make sure it doesn’t push the search results over multiple pages . Thankfully the search here didn’t do that - otherwise you’d have to do some more work here. For now though, we can move on!

This initial search does not contain all the information we need. It’s just the summary of the records, but it has enough information in here for us to create our first list.

Extracting the list of monuments

Next up I want to extract a few pieces of information about each monument; it’s name, and the URL of the full record. Run this code snippet and see the result in your console:

results_html <- read_html(remDr$getPageSource()[[1]])
results_html
Screen Shot 2019-07-25 at 11.23.00 am.png

You have just told R to get all the html on the page, and it saves it as an xml file. This means we can search in this file for the information we need.

Go back now to your browser, and open the inspector. The text of the title of the record is stored in a div with a class of “producttitle”. We need to extract that

Screen Shot 2019-07-25 at 11.27.29 am.png

Run this code snippet in R:

Search_results_title <- results_html %>% 
        html_nodes('.producttitle') %>% 
        html_text() %>% 
        enframe(name=NULL) %>% 
        rename(Monument= value)

You are telling it to look in “resultshtml” (the full xml file) and create an object called “Search_results_title”, then to look for the div with “producttitle”, and take the text from this. Enframe says make this into a tibble, and to name the variable as “Monument”.

Screen Shot 2019-07-25 at 1.46.25 pm.png

If you look at Search_results_title, we have created a tibble with one column and 571 records.

Next we need to grab the URL. Return to your inspector in your browser, and you can see that it’s in “morebutton” in the href class. It’s only a partial URL though, so after we extract that we need to do some pasting to make it complete.

Run this code snippet:

Search_results_url <- results_html %>% 
        html_nodes(".morebutton") %>% 
        html_attr("href") %>% 
        enframe(name=NULL)

You are creating a new tibble called Search_results_url, taking the xml file, and etxracting the href information from the div with the class of “morebutton”.

If you look at this output it shows us what we have collected.

We need to add in more information into the url.

Search_results_url <- results_html %>% 
        html_nodes(".morebutton") %>% 
        html_attr("href") %>% 
        enframe(name=NULL) %>% 
        rename(baseurl = value) %>% 
        mutate(urlpaste = "http://monumentaustralia.org.au/search/") %>% 
        mutate(url = (paste(urlpaste,baseurl, sep="" ))) %>% 
        select(-baseurl,-urlpaste)

This time, we are grabbing this information, renaming the variable as “base_url”, creating a new variable called “urlpaste” that contains the extra information we need to make the URL complete, creating a new variable called url, and pasting together “urlpaste” to “baseurl”. Finally, we are un-selecting the two dummy variables, leaving us just with the full url.

We only really need the list of urls, so I’m not going to bother combining the two files we just created. Think of “Search_results_title” as practice ;-)

Creating “Search_results_url” was our goal in that first stage, so now we can move onto the next step of getting all the information we need for each record.

Here is the complete code snippet for you for stage one or if you use github, here is a link.

library(httr)
library(RSelenium)
library(rvest)
library(xml2)
library(tidyverse)
library(lubridate)


##opening up server via docker
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
                                         port = 4445L,
                                         browserName = "chrome")
remDr$open()

##navigating to monument Aus site / search page

remDr$navigate("http://monumentaustralia.org.au/search")

# Give page time to load
Sys.sleep(5)

##take a screenshot to check
remDr$screenshot(display = TRUE)


##entering search term
searchname_field <- remDr$findElement(using = 'name', value = "keyword")
searchname_field$sendKeysToElement(list("\b", "Cook", "\uE007"))

Sys.sleep(5)
remDr$screenshot(display = TRUE)

##grabbing all the html and reating an xml file
results_html <- read_html(remDr$getPageSource()[[1]])



#grabbing url and completing it and creating first file we need 
Search_results_url <- results_html %>% 
        html_nodes(".morebutton") %>% 
        html_attr("href") %>% 
        enframe(name=NULL) %>% 
        rename(baseurl = value) %>% 
        mutate(urlpaste = "http://monumentaustralia.org.au/search/") %>% 
        mutate(url = (paste(urlpaste,baseurl, sep="" ))) %>% 
        select(-baseurl,-urlpaste)