Making thematic maps in R

Geog 315 T2 2023

Overview

We cover a lot of ground in this lab, including handling simple spatial data in R, and the options available for mapping using the tmap package. Keep in mind that you can refer back to last week’s material if you need to refresh your memory of anything we did previously.

The instructions this week are in several parts, to make it a bit easier to follow.

You may want to take a quick look at the last section detailing the assignment so you know what is expected of you, then come back to the top of the page and work through things in order to learn about the various tools you can use to accomplish what is asked of you.

Some of this week’s material may be hard to follow at first, and some of it won’t come together completely until I explain the pipe (%>%) operator in class (you won’t have to wait long for this), but bear with me and all will become clear (the relevant lectures are already available).

As explained last week, you should set up an RStudio project for this week’s work to make sure you can keep track of everything.

Videos

Video support for this week’s lab is here

Introducing the data

Libraries

First you need to load the sf library for handling spatial data

library(sf)

tmap for mapping

library(tmap)

and dplyr for data wrangling

library(dplyr)

If any of these are not available on your installation, then install them with the Tools - Install Packages… tool from the main menu.

Data

The data for this week are election results from the US Presidential Election of 2016 (the one that gave us president Donald J. Trump). I’m sticking with that election because the data are more interesting than in the more recent 2020 election.

You will find the data in the this geopackage file. Download it and save it to an accessible folder on the computer you are working on. Then create a new RStudio project in that folder like you learned last week.

Then you can read the data using the st_read function

results <- st_read('election.gpkg')
## Reading layer `election' from data source 
##   `/Users/osullid3/Documents/teaching/Geog315/_labs/week-03/election.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 3092 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -2775548 ymin: -170310 xmax: 2256204 ymax: 3165691
## Projected CRS: NAD27 / Conus Albers

The data are at the county level for the US (over 3000 counties of wildly varying populations), with numbers of votes recorded for the two major parties—Democrat (dem) and Republican (gop)—along with some other minor parties: Libertarian (lib), Green (grn), Evan McMullin (una) (an independent conservative candidate who only really stood in Utah), and ‘Other’ (oth).

Inspect the data

To make sure all is as it should be, take a look at the first few rows of the data using the head function

head(results)
## Simple feature collection with 6 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 761051.8 ymin: 828012.4 xmax: 1026128 ymax: 1287748
## Projected CRS: NAD27 / Conus Albers
##   state    name population votes   dem   gop grn  lib una oth
## 1    AL Autauga      54571 24661  5908 18110 105  538   0   0
## 2    AL Baldwin     182265 94090 18409 72780 453 2448   0   0
## 3    AL Barbour      27457 10390  4848  5431  18   93   0   0
## 4    AL    Bibb      22915  8748  1874  6733  17  124   0   0
## 5    AL  Blount      57322 25384  2150 22808  89  337   0   0
## 6    AL Bullock      10914  4701  3530  1139  10   22   0   0
##                             geom
## 1 MULTIPOLYGON (((892112.8 11...
## 2 MULTIPOLYGON (((780232.6 94...
## 3 MULTIPOLYGON (((1026128 105...
## 4 MULTIPOLYGON (((845034.9 11...
## 5 MULTIPOLYGON (((871035.6 12...
## 6 MULTIPOLYGON (((988752.8 10...

[By the way, there is also a tail function.]

You can also see the dataset by clicking on it in the Environment tab of RStudio. If you want to make sure the spatial aspects are working OK, then try

results %>%
  select(population) %>%
  plot()

OK… the goal eventually will be to make a map of the election results. Here, I’ll explain the variables in the dataset.

The spatial units are US counties. Each county is in a particular state and has a name. This information is contained in the state and name attributes. The state’s each have a two letter abbreviation, as shown on this map, so for example, California is ‘CA’. Not all counties have unique names, so we need the state names to uniquely identify them. For example

results %>%
  filter(name == 'Washington')
## Simple feature collection with 32 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -2118948 ymin: 774485.4 xmax: 2256204 ymax: 2845010
## Projected CRS: NAD27 / Conus Albers
## First 10 features:
##    state       name population  votes    dem   gop  grn  lib una oth
## 1     AL Washington      17581   8492   2366  6031   22   73   0   0
## 2     AR Washington     203065  78153  32296 40418 1275 3280   0 884
## 3     CO Washington       4814   2708    289  2284   11   81  14  29
## 4     DC Washington     601723 280272 260223 11553 3995 4501   0   0
## 5     FL Washington      24896  11144   2261  8630   21  191   0  41
## 6     GA Washington      21187   8422   4187  4138    0   97   0   0
## 7     ID Washington      10198   4454    776  3283   26  138 175  56
## 8     IL Washington      14716   7376   1446  5566   75  289   0   0
## 9     IN Washington      28262  11333   2636  8206    0  491   0   0
## 10    IA Washington      21704  10824   3938  6170   79  458  92  87
##                              geom
## 1  MULTIPOLYGON (((745944.3 98...
## 2  MULTIPOLYGON (((182541.1 14...
## 3  MULTIPOLYGON (((-569757.9 1...
## 4  MULTIPOLYGON (((1622264 191...
## 5  MULTIPOLYGON (((1009201 888...
## 6  MULTIPOLYGON (((1222984 121...
## 7  MULTIPOLYGON (((-1599544 25...
## 8  MULTIPOLYGON (((591785.5 17...
## 9  MULTIPOLYGON (((869100.4 17...
## 10 MULTIPOLYGON (((374443.1 20...

will return a list of 32 (yes 32!) counties called Washington (although it only shows the first 10). There are also several Jefferson counties and a few other duplicates besides. (Fun Simpsons fact: there are as many as 36 places called Springfield in the US)

Of more relevance to the assignment are the vote and population information contained in the other variables as follows:

  • population an estimate of the county populations. Note that these vary widely. Use summary(results$population) to see just how much
  • votes the total number of votes cast in the presidential election in the county
  • dem votes cast for the Democratic Party candidate (i.e. Hillary Clinton)
  • gop votes cast for the Republican Party (GOP) candidate (i.e. Donald Trump)
  • grn, lib, una, oth votes case for respectively the Green, Libertarian, Independent, and ‘Other’ candidates - apart from Utah, these numbers are rather low across the country

In the assignment it is important to consider if these vote counts are what you want to map (and why) or if it makes more sense to make additional variables for the vote share of the different candidates, or even of the margin of victory, or difference between various candidate’s votes. To quickly map any one of these attributes, or any new attribute that you make, use the following (you can change the variable or the colours easily) as we did last week

tm_shape(results) +
  tm_polygons(col = 'dem', palette = 'Blues')

Data preparation

The spatial data we just loaded has some quirks, so in this document, I explain how you can tidy it up to make it more suitable for the assignment. To do this, we will be using the dplyr library which provides functions for selecting, filtering and combining data in various ways. This week we’ll just be seeing these in action, but in later weeks we’ll be looking more closely at their operation.

The tasks covered in this document are

Filtering data

The data in the election.gpkg file is a bit odd in that it has Alaska and Hawai’i where Mexico is in real life. To show how easy it is to select data in R we will use the filter function from the dplyr package to get rid of these. (You might not want to do this in your map, this is just to demonstrate filtering).

The variable we need to do the selection on is state. Say we wanted just to map California, we can do this by filtering the results dataset then plotting it:

results %>%
  filter(state == 'CA') %>%
  plot()

How does this work? You read the command above something like “take the results dataset, then filter it, then plot it”. The pipe %>% directs whatever is before it, into whatever function comes after it. The filter function specifies a condition for which cases to retain in the filtered dataset. Here we want all those cases with the state value ‘CA’ (i.e., California). If we wanted all those states NOT California, we can use not equals which is designated by !=

results %>%
  filter(state != 'CA') %>%
  plot()
## Warning: plotting the first 9 out of 10 attributes; use max.plot = 10 to plot
## all

So, to get rid of Alaska and Hawai’i we can do this:

lower48 <- results %>%
  filter(state != 'AK' & state != 'HI')

Here we’ve combined the requirement that the state not be Alaska (AK) and not be Hawai’i (HI) using the & operator. This works fine and there is nothing wrong with this way of doing things.

This time, because we might want to retain the result we assign the output to a new variable called lower48 using <-.

We could also use the pipe operation to filter the data twice like this:

lower48 <- results %>%
  filter(state != 'AK') %>%
  filter(state != 'HI')

Which will give us the same result:

plot(lower48)
## Warning: plotting the first 9 out of 10 attributes; use max.plot = 10 to plot
## all

Because filtering data is not our primary focus right now, we will worry more about exactly how this works in later lectures and assignments.

For now, it is enough to know that we can do this kind of filtering to make new datasets, that the filter function is how we do it, and that the pipe operator %>% is a neat way to do it.

Again, the way to read the the command above is “start with results as the input, pipe it into the first filter (which throws away Alaska, AK) then pipe it into a second filter (which throws away Hawai’i, HI)”. Pipe functions are central to the R tidyverse.

Either way, we now we have a dataset with only the contiguous (‘lower 48’) states of the US.

Saving data

Saving a dataset is easy. Just as there is a st_read function, there is an st_write function. We tell it the dataset to save, and the filename to use, and that’s pretty much it.

The only complication is that if the file already exists, then we also have to tell it that it is OK to delete the existing file, using a delete_dsn parameter (delete_dsn denotes ‘delete data source named’).

So here goes with delete_dsn set to TRUE just in case you end up running this command again later.

st_write(lower48, 'lower48.gpkg', delete_dsn = TRUE)
## Deleting source `lower48.gpkg' using driver `GPKG'
## Writing layer `lower48' to data source `lower48.gpkg' using driver `GPKG'
## Writing 3087 features with 10 fields and geometry type Multi Polygon.

That’s it. You should find that there is now a new geopackage file in the folder you are working in called lower48.gpkg. We can also save to other data formats. For example

st_write(lower48, 'lower48.geojson', driver = 'GeoJSON', delete_dsn = TRUE)
## Deleting source `lower48.geojson' using driver `GeoJSON'
## Writing layer `lower48' to data source `lower48.geojson' using driver `GeoJSON'
## Writing 3087 features with 10 fields and geometry type Multi Polygon.

will save a GeoJSON file.

Grouping data

The lower48 dataset includes an attribute state which tells us the US state in which each county is found. We can use this information to make a new dataset by dissolving counties together based on the value of this attribute. In ESRI Arc products you may have seen this already as a ‘dissolve’ operation. It would be nice if exactly that function was available in sf but we actually use the dplyr function group_by instead. Again we use pipes:

states <- lower48 %>%
  group_by(state) %>%
  summarise(across(where(is.numeric), sum)) %>%
  as.data.frame() %>%
  st_as_sf()

The last two lines here are probably not required, but are a workaround for a problem reported on some combinations of package versions.

Here we pass the lower48 dataset to the group_by function, which will use the state attribute to group counties. The second pipe sends this result to the summarise function, which uses across with a where clause to check if attributes are numeric (the is.numeric parameter), and if they are, combines values by using a sum operation. All this gives us a states level dataset with population and votes data summed to give appropriate results for the states.

tm_shape(states) +
  tm_polygons(col = 'population')