come for the viz, stay for the craft (or vice versa)

Data Blog

Creating an Arc Plot Part 1

In previous posts here and here I outlined how I built a scraper in R to collect data. Next I had to experiment with some different ways to visualise the data.

One of my initial sketches was to show a timeline from the initial event year (in this case the journey of the Endeavour up the east coast of Australia i 1770) to the year when a monument was dedicated. One of the sketches I created is what is called an arc plot, which is part of the “network” family.


A lot of the visualisation “geoms” that you regularly work with in charts show aggregations or proportions, but network graphs show connections between data points. They are good to show process flows, associations, connections, or where things don’t connect in ways that are expected. The foundation of a graph is the node and the edge. Nodes are the things that you are trying to show the connections between. The edges are the connections. In this case our nodes are the years; 1770 and the connections to the year that the monument was dedicated. So in this case all our edges will spring from 1770 and branch to the other years. Most of the networks you with will have more source nodes than this example.

The data structure you need is a little different from a regular chart. We need a file that has in each row a start node and an end node, which will tell whatever you are using to chart your graph, you need to start at A and end at B. In our case it will say you need to start on the node called 1770, draw an “edge” (or line), and end on the node that is called the year of the dedication of the monument.

My process is:

  1. Get my packages ready in R

  2. Import data

  3. reformat data into an edge file

  4. prepare the data so it can be read by the graphing tool

  5. draw the graph!

Get my packages Ready in R

The packages you need to have installed and in your library are:

  • tidyverse

  • lubridate

  • ggraph

  • tidygraph

  • scales

Some of these may not be needed until the next stage, but may as well get them ready now.

Import the data

I have saved a smaller data set on github that is based on the scraping file we created in the previous post, but please:

IMPORTANT:

This is a draft data set. It is not verified or validated. It’s to only be used for this example. Don’t use it for any research you may be doing. Not all the data points may be correct, and MANY are missing.

You have been warned :-)

First up, run the following snippet in your R console:

cookExample <- readr::read_csv("https://raw.githubusercontent.com/KellyTall/Hellomister_DataBlog/master/cookExample.csv")

write_csv(cookExample, "cookExample.csv")

View(cookExample)

This will read the data directly from my github and create a new object called cookExample. If you do want to save the file to your own system for whatever reason as a csv, run the second line of code and it will save it in your working directory. The final line lets you view and see the variables you have to play with.

Reformat Data into an Edge file

For this example the most important columns we will need are the “start” node and the “end” node for each of the arcs we want to draw. So we have to start with a column containing “1770” in every row, and need to make sure we have a column containing the monument dedication year. Each of these rows will create an “edge”, which is a line that connects our nodes (in this case the dates) together.

Run the following snippet to prepare this edge file.

# prepare edge file
cook_edge <- cookExample %>% 
        mutate(from = 1770) %>% 
        select(from, Dedication_Year, Figure, Monument_Type) %>% 
        rename(to=Dedication_Year) %>% 
        mutate(to=as.numeric(to)) %>% 
        group_by(Figure, Monument_Type, from, to) %>% 
        summarise(weight = n()) %>% 
        na.omit()

The above snippet des the following:

  1. Takes our original data and creates a new tibble called “cook_edge”

  2. Creates a new column called “from” and fills every row with 1770 - called from as this is where the edge “starts from”

  3. Keeps the variables from, Dedication Year, Figure and Monument_Type

  4. Renames Dedication_Year to “to” - called “to” as this is where the edge or line we are drawing is “going to”

  5. this mutate makes sure that “to” is a number and not misread as a character. From memory I did this as there were some issues with the type of variable it created - any time you deal with dates there is always some nonsense especially if the dates range from the 19th to the 21st century. I think this was to get everything in the right format.

  6. Next I use the dplyr verb “group_by” to group the variables, and then collapse the rows to create the weight for each of the edges,

  7. Next I use the dplyr verb “summarise” - n( ) sums the rows that match for the grouping you specified . It looks for all the edges that are common, and sums them.

We will use the new variable we have created called “weight” as part of our encoding. Collapsing “like” edges and creating a weight is a good habit to get into. Imagine we have 4 monuments for Cook, that are all gardens, and all dedicated in 1970. Instead of having four rows of data for each graden, and then drawing this line four times, we hold this information in a new column called “weight”. This means we can still use this information as a visual encoding, but we don’t force the edge to be drawn four times. This can save a lot of rows in your data set, as well as time to draw your edges (some networks are massive and take a lot of processing time) so always try and collapse them.

Have a look at the new tibble cook_edge and make sure you understand what you just did.

The next step is to make our graph object. Run the following snippet:

cook_tidy <- tbl_graph(edges=cook_edge, directed = TRUE)

This is telling R to take our edge file, and using “tbl_graph” which is part of tidygraph, make an object that we can draw as a network. We are telling it what tibble to use, and we are also specifying that this is a directed graph. Directed just means that our edges head in a certain direction: from and to. Some networks are not directed - they could go in any direction. But as we are dealing with progressing time, in this case an edge cannot go from 1970 back to 1770. It can only head one way. If you had a network of peers, and you want to show a connection from Janet to Jane, and you know it is a two way connection (they are friends) then this is an undirected graph and the edge starts and ends at both nodes (people). Sometimes you have to really stop and think about the direction prompt. It’s not always clear cut. For example, in social media I may follow Bob, but he does not follow me. That is a directed edge going from me to Bob, but not back the other way. But I may follow Frank, and he also follows me, so that could be a non-directed graph if all things are equal (ie: everyone follows everyone), or a directed graph (if not everyone follows everyone). If it’s directed, there would be one line for my relationship to Bob, and then a line from me to Frank, and another from Frank back to me. Always think carefully about this and how the connections work.

Have a look at this object:

Screen Shot 2019-07-31 at 2.04.14 pm.png

It tells us that it’s created a table graph, with other information about the nodes and edges. The first row tells us that it has created one edge from 1770 to 1994, and it’s a flagpole (!!) dedicated to Joseph Banks, and there is one of them. If there were two flagpoles dedicated to Banks in the same year, then it would have a weight of two.

Now run this code snippet and the output which is displayed after running cook1:

cook1 <- ggraph(cook_tidy, layout = 'linear') + 
        geom_edge_arc()
cook1
Rplot.png

Straight out of the box we have a pretty good start, and it doesn’t look too dissimilar to my initial sketch. All of the edges start at one point, and then are drawn to the date that the monument was dedicated. Let’s add some information that also tells us “how many” were dedicated in each year. We can use the weight variable to encode the line width to pass on that information. Use the following snippet:

cook2 <- ggraph(cook_tidy, layout = 'linear') + 
        geom_edge_arc(aes(width=weight))

cook2
Rplot02.png

Now the width of our arc / edge is weighted to the number of monuments in that year. This is done by adding the aesthetic of width=weight. It’s a bit heavy though so we can play with the opacity of the lines.

cook3 <- ggraph(cook_tidy, layout = 'linear') + 
        geom_edge_arc(aes(width=weight), alpha=.5) 

cook3
Rplot03.png


We also have the historical figure the monument is dedicated to. We can try and use that as an aesthetic.

cook4 <- ggraph(cook_tidy, layout = 'linear') + 
        geom_edge_arc(aes(width=weight, colour=Figure), alpha=.5)

cook4
Rplot04.png

This shows us that according to our data the first monument was actually to Banks, and there were sporadic monuments dedicated to Cook etc, but in 1970 (200 anniversary) many are dedicated, and continue to be dedicated to Cook. Quick reminder that this data is not a complete data set, and some of the records may be included in error, but this gives you an idea of the way an arc plot could be used to show a data set trying to show the progression from one moment to future moments.

The next snippet uses the inbuilt black and white theme in ggplot2 (just to take out some of the clutter for now)

cook5 <- ggraph(cook_tidy, layout = 'linear') + 
        geom_edge_arc(aes(width=weight, colour=Figure), alpha=.5) +
        theme_bw()
cook5
Rplot05.png


In our datas te we also have a categorical variable that includes information about the type of monument. Let’s a have a look at what the graph looks like if we use that in the aesthetics instead of figure.

cook6 <- ggraph(cook_tidy, layout = 'linear') + 
        geom_edge_arc(aes(width=weight, colour=Monument_Type), alpha=.5) +
        theme_bw()
cook6

Way too many colours here. If your specific research question is around the changing fashion of monuments over time then you’d need to use a different type of visualisation here, or cllapse the categories together (ie: tree and trees together, possibly sculpture and Monument together etc). The arcs overlap on the same year as well, so using colour you are making it hard for your audience to pull appart all the information you need to be showing.

So, I was pretty happy with cook5 as the chart that I progressed further. The next post will be how to create the “upside down” dot plot that my sketch showed trailing off at the bottom of the arcs to give the viewer more information about the number of monuments dedicated in each year.

(If you’d like to know more about creating graphs through tidygraph head here and for ggraph, go here.These two packages make a lot more complex graphs possible. Explore away! They were both created by Thomas Lin Pedersen, and he is on twitter here)

The full R script for what we have run above is here.