come for the viz, stay for the craft (or vice versa)

Data Blog

Web Scraping Part Two

Let’s loop!

Let’s loop!

My previous post here outlined why I was undertaking this exercise, the types of environment you need to set up, as well as how to extract the initial list of URLs.

This means we can move on now to create the scraper that will collect data on all the records we need, as well as how to reformat it into the final data set.

Like previously, it’s a good idea to think about the individual steps we want our scraper and R script to do for us.

  1. Look at the list or URLs and visit each page/record on that list

  2. Take from each record data that we specify

  3. Combine all of these into one file with a row per record

We could do this individually for each page like we did for the last exercise (ie: send the scraper to navigate to page A, take the data, then write a command to navigate to page B etc, but we have over 500 rows so that would be a lot of code). Instead what we want to do is create a loop, where R will cycle through all of the code for each url on our list, and then stop when it goes through all of the URLs. Now, writing loop functions used to scare the daylights out of me, but trust me when I say it’s actually pretty easy as you will see.

Before we do this though, and in the spirit of breaking things down, we should try and collect the data from one record, then write the loop function. This will be familiar and easy as it’s the same sort of step as the previous post. We navigate to the page, gather all the HTML into an xml file, and then instruct R to extract different fields based on the tags in the file.

A Simple Example

Run this code snippet which tells R to navigate to this page (which is the first link from the list we created) and read in the html. We are calling this object “webpage”

webpage <- read_html("http://monumentaustralia.org.au/search/display/30861-%60last-man%60-and--%60last-shilling%60-monument")

Now we want to collect the text and information relating to this monument.

Monument_Detail_Name <- html_nodes(webpage, "h1") %>% 
        html_text() %>% 
        enframe(name=NULL)%>% 
        rename(Monument= value)
            
monument_header <- html_nodes(webpage, "th") %>%
        html_text() %>% 
        enframe(name=NULL) %>% 
        rename(Header= value)
        
monument_detail <- html_nodes(webpage, "td") %>% 
        html_text() %>% 
        enframe(name=NULL) %>% 
        rename(Detail = value)

Have a look at the page in your browser and open your inspector. “H1” tags store the name of the monument, “th” tags store the header names (useful as it tells us what the data is about), and “td” stores the text/detail. See the output below.

Screen Shot 2019-07-26 at 3.27.12 pm.png

We have all the information now for the monument, now we need to format and combine each of the tibbles we just created.

Monument_Detail_Info <- bind_cols(monument_header, monument_detail)
        
Monument_Detail_Info <- Monument_Detail_Info %>%
        mutate(Monument = Monument_Detail_Name$Monument) %>%
        select(Monument, Header, Detail) %>% 
        mutate(url = "http://monumentaustralia.org.au/search/http://monumentaustralia.org.au/search/display/30861-%60last-man%60-and--%60last-shilling%60-monument")

The first is to combine the columns together (bind_cols) and save this in Monument_Detail_Info. Next we create a column that contains the name of the Monument, re-order the columns (not necessary, just me!), and finally, add in a col with the url of the record. The output will look like this:

Screen Shot 2019-07-26 at 3.38.10 pm.png

That contains all the info we need, but it’s not in the “one row per record” format that we need. So we simply reformat it by running this command:

Monument <- Monument_Detail_Info %>% 
        spread(Header, Detail) 

and the output will look like this:

This is exactly how we need the formatting, as well as the information we have to collect. So now all we need to do is put all of that together with the list we created in the previous blog post.

(If you’d like the code for this simple / practice version you can find it here on github)

Creating the Loop

We have a clear idea now what we need the loop function to do when it hits each page, so now we can create the loop function itself. The tibble we created with the url list is called “Search_results_url” and looks like this:

lapply is a handy tool in base R that returns a list the same length as the input or x. So, if you look at the first line of the snippet below, we are telling R, to look at each row of Search_results_url, in the col called url, and do the following (everything in the curly brackets) until it reaches the end of the list. I hope I’m explaining that simply, because it’s pretty simple. Much more simpler than it feels to hear “create a loop function…” the first time. Like most things to do with analytics, the language (or discourse) surrounding the practice is much more complicated than the actual doing…but maybe you are here from the humanities and maybe you’ve read Foucault, and you are nodding and thinking “yes! This codified language always sounds more complex than what it actually is”.

So everytime you see the little ‘i” in the snippet we are just telling the loop to look in the row it is up to. If you run this code snippet it can take a while as it’s repeating the task multiple times, but I am always thankful when I think that my eight year old macbook air can run these little functions efficiently and smoothly and repeat a dull and boring tasks for me hundreds of times in less than a few minutes - can you imagine doing all this manually? (Give your machine a little pat and thank it for its service)

Cook_detail <- lapply(Search_results_url$url, function(i){
        
        webpage <- read_html(i)
        
        Monument_Detail_Name <- html_nodes(webpage, "h1") %>% 
                html_text() %>% 
                enframe(name=NULL)%>% 
                rename(Monument= value)
        
        
        monument_header <- html_nodes(webpage, "th") %>%
                html_text() %>% 
                enframe(name=NULL) %>% 
                rename(Header= value)
        
        monument_detail <- html_nodes(webpage, "td") %>% 
                html_text() %>% 
                enframe(name=NULL) %>% 
                rename(Detail = value)
        
        Monument_Detail_Info <- bind_cols(monument_header, monument_detail)

        Monument_Detail_Info <- Monument_Detail_Info %>%
                mutate(Monument = Monument_Detail_Name$Monument) %>%
                select(Monument, Header, Detail) %>% 
                mutate(url = i)

})

Once it’s run, your output will be 571 tibbles stored in a data frame (a data object? A list? I never know the right terminology here but basically it’s like a suitcase that can hold all different types of data together), with the last one looking like this:

Screen Shot 2019-07-26 at 4.51.38 pm.png

To do a row combine on them all use “do.call”, then reformat in the same way we did the simple example.

Cook_finaldf <- do.call(rbind, Cook_detail)   
# View(Tasman_finaldf)

Cook <- Cook_finaldf %>% 
        spread(Header, Detail) %>% 
        left_join(Cook_Results)
        
write_csv(Cook, "cookDraft.csv")

Now, you’ve done the large part of the work, and you can see the csv file that you’ve created which will be saved in the working directory you are using for R. The next stage is quite finneky, and was a very iterative process. There is a lot of reformatting I did to get the file clean but I tried to make as much of it reproducible as possible so I could run this if I needed to re-extract the data for any reason. I won’t go through this in detail, but most of it is just reformatting dates, tidying variable names, and fixing up the geo-location information. You will get a few warnings if you run this code snippet, but it’s all OK. I won’t focus on this code snippet as it’s not the focus of my post, but if you do have any questions about it (or suggestions on making it a little less verbose) please drop me a line!

Cook_cleaned <- Cook %>% 
        mutate(Monument= gsub("Print Page ", "", Monument)) %>% 
        mutate(`Actual Event End Date:`=dmy(`Actual Event End Date:`)) %>% 
        mutate(`Actual Event STart Date:`=dmy(`Actual Event STart Date:`)) %>% 
        mutate(`Actual Monument Dedication Date:`=dmy(`Actual Monument Dedication Date:`)) %>% 
        mutate(`Approx. Monument Dedication Date:`=dmy(`Approx. Monument Dedication Date:`)) %>%
        mutate(`GPS Coordinates:`=gsub("Long", "\\1 Long", `GPS Coordinates:`)) %>%
        mutate(`GPS Coordinates:`=gsub("Note", "\\1 Note", `GPS Coordinates:`)) %>%
        separate(`GPS Coordinates:`, "Coordinates1", sep="Note", remove=FALSE) %>%
        mutate(Coordinates1 = gsub(" ", "", Coordinates1)) %>% 
        mutate(Coordinates1 = gsub(":", "", Coordinates1)) %>% 
        separate(Coordinates1, into = c("Lat", "Long"), "(?<=[a-z])(?=[0-9])") %>% 
        mutate(Lat = gsub("Lat", "", Lat)) %>% 
        mutate(Lat = gsub("Long", "", Lat)) %>% 
        mutate(Source = "Monument Australia") %>% 
        mutate(Date_Sourced = today(tzone = "")) %>% 
        rename(Actual_Event_End_Date ='Actual Event End Date:',
               Actual_Event_Start_Date ='Actual Event STart Date:',
               Monument_Dedication_Date = 'Actual Monument Dedication Date:',
               Address='Address:',
               Approx_Event_End_Date='Approx. Event End Date:',
               Approx_Event_Start_Date='Approx. Event Start Date:',
               Area = 'Area:',
               Monument_Designer = "Monument Designer:",
               Monument_Manufacturer = "Monument Manufacturer:",
               Monument_Theme = "Monument Theme:",
               Monument_Type = "Monument Type:",
               Link = 'Link:',
               State = "State:",
               Sub_Theme = 'Sub-Theme:') %>% 
        select(-'Approx. Monument Dedication Date:', -'GPS Coordinates:', -Actual_Event_End_Date, -Approx_Event_Start_Date) %>%
        mutate(Figure = "Cook") %>% 
        select(Figure, Monument, url, Monument_Dedication_Date,Address, State, Area, Lat,
               Long, Monument_Designer, Monument_Manufacturer, Monument_Type, Monument_Theme, Sub_Theme,
               Actual_Event_Start_Date, Approx_Event_End_Date, Date_Sourced, Source)


write_csv(Cook_cleaned, "Cook_data.csv")

If you compare the first draft csv file you exported to the more recent one above, then you can see the changes that have happened. I look at the snippet above as well, and think “WTF" because so much of it are little things that you find and may only ever use once, so they are pretty easy to forget.

I was able to upload this file and share it with the researcher, who then went through and checked the information to see if it was relevant, as well as fill in any important blanks we had regarding dates. The next blog post I share will use a clean and cut down data version, so we can create an arc plot similar to this one below:

cook 1.jpg

The R code I used for the data is available here on github.