Overview
We cover a lot of ground in this lab, including handling simple
spatial data in R, and the options available for mapping using
the tmap package. Keep in mind that you can refer back to
last week’s material if you need to refresh
your memory of anything we did previously.
The instructions this week are in several parts, to make it a bit easier to follow.
You may want to take a quick look at the last section detailing the assignment so you know what is expected of you, then come back to the top of the page and work through things in order to learn about the various tools you can use to accomplish what is asked of you.
Some of this week’s material may be hard to follow at first, and some
of it won’t come together completely until I explain the pipe
(%>%) operator in class (you won’t have to wait long for
this), but bear with me and all will become clear (the relevant lectures
are already available).
As explained last week, you should set up an RStudio project for this week’s work to make sure you can keep track of everything.
Videos
Video support for this week’s lab is here
Introducing the data
Libraries
First you need to load the sf library for handling
spatial data
library(sf)
tmap for mapping
library(tmap)
and dplyr for data wrangling
library(dplyr)
If any of these are not available on your installation, then install them with the Tools - Install Packages… tool from the main menu.
Data
The data for this week are election results from the US Presidential Election of 2016 (the one that gave us president Donald J. Trump). I’m sticking with that election because the data are more interesting than in the more recent 2020 election.
You will find the data in the this geopackage file. Download it and save it to an accessible folder on the computer you are working on. Then create a new RStudio project in that folder like you learned last week.
Then you can read the data using the st_read
function
results <- st_read('election.gpkg')
## Reading layer `election' from data source
## `/Users/osullid3/Documents/teaching/Geog315/_labs/week-03/election.gpkg'
## using driver `GPKG'
## Simple feature collection with 3092 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -2775548 ymin: -170310 xmax: 2256204 ymax: 3165691
## Projected CRS: NAD27 / Conus Albers
The data are at the county level for the US (over 3000 counties of
wildly varying populations), with numbers of votes recorded for the two
major parties—Democrat (dem) and Republican
(gop)—along with some other minor parties: Libertarian
(lib), Green (grn), Evan McMullin
(una) (an independent conservative candidate who only
really stood in Utah), and ‘Other’ (oth).
Inspect the data
To make sure all is as it should be, take a look at the first few
rows of the data using the head function
head(results)
## Simple feature collection with 6 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 761051.8 ymin: 828012.4 xmax: 1026128 ymax: 1287748
## Projected CRS: NAD27 / Conus Albers
## state name population votes dem gop grn lib una oth
## 1 AL Autauga 54571 24661 5908 18110 105 538 0 0
## 2 AL Baldwin 182265 94090 18409 72780 453 2448 0 0
## 3 AL Barbour 27457 10390 4848 5431 18 93 0 0
## 4 AL Bibb 22915 8748 1874 6733 17 124 0 0
## 5 AL Blount 57322 25384 2150 22808 89 337 0 0
## 6 AL Bullock 10914 4701 3530 1139 10 22 0 0
## geom
## 1 MULTIPOLYGON (((892112.8 11...
## 2 MULTIPOLYGON (((780232.6 94...
## 3 MULTIPOLYGON (((1026128 105...
## 4 MULTIPOLYGON (((845034.9 11...
## 5 MULTIPOLYGON (((871035.6 12...
## 6 MULTIPOLYGON (((988752.8 10...
[By the way, there is also a tail function.]
You can also see the dataset by clicking on it in the Environment tab of RStudio. If you want to make sure the spatial aspects are working OK, then try
results %>%
select(population) %>%
plot()
OK… the goal eventually will be to make a map of the election results. Here, I’ll explain the variables in the dataset.
The spatial units are US counties. Each county is in a particular
state and has a name. This information is contained in the
state and name attributes. The state’s each
have a two letter abbreviation, as shown on this
map, so for example, California is ‘CA’. Not all counties have
unique names, so we need the state names to uniquely identify them. For
example
results %>%
filter(name == 'Washington')
## Simple feature collection with 32 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -2118948 ymin: 774485.4 xmax: 2256204 ymax: 2845010
## Projected CRS: NAD27 / Conus Albers
## First 10 features:
## state name population votes dem gop grn lib una oth
## 1 AL Washington 17581 8492 2366 6031 22 73 0 0
## 2 AR Washington 203065 78153 32296 40418 1275 3280 0 884
## 3 CO Washington 4814 2708 289 2284 11 81 14 29
## 4 DC Washington 601723 280272 260223 11553 3995 4501 0 0
## 5 FL Washington 24896 11144 2261 8630 21 191 0 41
## 6 GA Washington 21187 8422 4187 4138 0 97 0 0
## 7 ID Washington 10198 4454 776 3283 26 138 175 56
## 8 IL Washington 14716 7376 1446 5566 75 289 0 0
## 9 IN Washington 28262 11333 2636 8206 0 491 0 0
## 10 IA Washington 21704 10824 3938 6170 79 458 92 87
## geom
## 1 MULTIPOLYGON (((745944.3 98...
## 2 MULTIPOLYGON (((182541.1 14...
## 3 MULTIPOLYGON (((-569757.9 1...
## 4 MULTIPOLYGON (((1622264 191...
## 5 MULTIPOLYGON (((1009201 888...
## 6 MULTIPOLYGON (((1222984 121...
## 7 MULTIPOLYGON (((-1599544 25...
## 8 MULTIPOLYGON (((591785.5 17...
## 9 MULTIPOLYGON (((869100.4 17...
## 10 MULTIPOLYGON (((374443.1 20...
will return a list of 32 (yes 32!) counties called Washington (although it only shows the first 10). There are also several Jefferson counties and a few other duplicates besides. (Fun Simpsons fact: there are as many as 36 places called Springfield in the US)
Of more relevance to the assignment are the vote and population information contained in the other variables as follows:
populationan estimate of the county populations. Note that these vary widely. Usesummary(results$population)to see just how muchvotesthe total number of votes cast in the presidential election in the countydemvotes cast for the Democratic Party candidate (i.e. Hillary Clinton)gopvotes cast for the Republican Party (GOP) candidate (i.e. Donald Trump)grn,lib,una,othvotes case for respectively the Green, Libertarian, Independent, and ‘Other’ candidates - apart from Utah, these numbers are rather low across the country
In the assignment it is important to consider if these vote counts are what you want to map (and why) or if it makes more sense to make additional variables for the vote share of the different candidates, or even of the margin of victory, or difference between various candidate’s votes. To quickly map any one of these attributes, or any new attribute that you make, use the following (you can change the variable or the colours easily) as we did last week
tm_shape(results) +
tm_polygons(col = 'dem', palette = 'Blues')
Data preparation
The spatial data we just loaded has some quirks, so in this document,
I explain how you can tidy it up to make it more suitable for the
assignment. To do this, we will be using the dplyr library
which provides functions for selecting, filtering and combining data in
various ways. This week we’ll just be seeing these in action, but in
later weeks we’ll be looking more closely at their operation.
The tasks covered in this document are
- Filtering data to remove Alaska and Hawai’i
- Saving data to spatial data files to make the filtering we did permanent
- Grouping data to make a states dataset
- Adding new attributes to a
dataset using
mutate
Filtering data
The data in the election.gpkg file is a bit odd in that
it has Alaska and Hawai’i where Mexico is in real life. To show how easy
it is to select data in R we will use the filter
function from the dplyr package to get rid of these. (You
might not want to do this in your map, this is just to demonstrate
filtering).
The variable we need to do the selection on is state.
Say we wanted just to map California, we can do this by filtering the
results dataset then plotting it:
results %>%
filter(state == 'CA') %>%
plot()
How does this work? You read the command above something like “take
the results dataset, then filter it,
then plot it”. The pipe
%>% directs whatever is before it, into whatever
function comes after it. The filter function specifies a
condition for which cases to retain in the filtered dataset.
Here we want all those cases with the state value ‘CA’
(i.e., California). If we wanted all those states NOT
California, we can use not equals which is designated by
!=
results %>%
filter(state != 'CA') %>%
plot()
## Warning: plotting the first 9 out of 10 attributes; use max.plot = 10 to plot
## all
So, to get rid of Alaska and Hawai’i we can do this:
lower48 <- results %>%
filter(state != 'AK' & state != 'HI')
Here we’ve combined the requirement that the state not be Alaska (AK)
and not be Hawai’i (HI) using the & operator.
This works fine and there is nothing wrong with this way of doing
things.
This time, because we might want to retain the result we assign the
output to a new variable called lower48 using
<-.
We could also use the pipe operation to filter the data twice like this:
lower48 <- results %>%
filter(state != 'AK') %>%
filter(state != 'HI')
Which will give us the same result:
plot(lower48)
## Warning: plotting the first 9 out of 10 attributes; use max.plot = 10 to plot
## all
Because filtering data is not our primary focus right now, we will worry more about exactly how this works in later lectures and assignments.
For now, it is enough to know that we can do this kind of filtering
to make new datasets, that the filter function is how we do
it, and that the pipe operator %>% is a neat
way to do it.
Again, the way to read the the command above is “start with
results as the input, pipe it into the first
filter (which throws away Alaska, AK) then
pipe it into a second filter (which throws away Hawai’i,
HI)”. Pipe functions are central to the R
tidyverse.
Either way, we now we have a dataset with only the contiguous (‘lower 48’) states of the US.
Saving data
Saving a dataset is easy. Just as there is a st_read
function, there is an st_write function. We tell it the
dataset to save, and the filename to use, and that’s pretty much it.
The only complication is that if the file already exists, then we
also have to tell it that it is OK to delete the existing file, using a
delete_dsn parameter (delete_dsn denotes
‘delete data source named’).
So here goes with delete_dsn set to TRUE
just in case you end up running this command again later.
st_write(lower48, 'lower48.gpkg', delete_dsn = TRUE)
## Deleting source `lower48.gpkg' using driver `GPKG'
## Writing layer `lower48' to data source `lower48.gpkg' using driver `GPKG'
## Writing 3087 features with 10 fields and geometry type Multi Polygon.
That’s it. You should find that there is now a new geopackage file in
the folder you are working in called lower48.gpkg. We can
also save to other data formats. For example
st_write(lower48, 'lower48.geojson', driver = 'GeoJSON', delete_dsn = TRUE)
## Deleting source `lower48.geojson' using driver `GeoJSON'
## Writing layer `lower48' to data source `lower48.geojson' using driver `GeoJSON'
## Writing 3087 features with 10 fields and geometry type Multi Polygon.
will save a GeoJSON file.
Grouping data
The lower48 dataset includes an attribute
state which tells us the US state in which each county is
found. We can use this information to make a new dataset by
dissolving counties together based on the value of this
attribute. In ESRI Arc products you may have seen this already
as a ‘dissolve’ operation. It would be nice if exactly that function was
available in sf but we actually use the dplyr
function group_by instead. Again we use pipes:
states <- lower48 %>%
group_by(state) %>%
summarise(across(where(is.numeric), sum)) %>%
as.data.frame() %>%
st_as_sf()
The last two lines here are probably not required, but are a workaround for a problem reported on some combinations of package versions.
Here we pass the lower48 dataset to the
group_by function, which will use the state
attribute to group counties. The second pipe sends this result to the
summarise function, which uses across with a
where clause to check if attributes are numeric (the
is.numeric parameter), and if they are, combines values by
using a sum operation. All this gives us a states level
dataset with population and votes data summed to give appropriate
results for the states.
tm_shape(states) +
tm_polygons(col = 'population')