Worksheet 11

Published

April 7, 2026

Packages

library(ggbiplot)
library(tidyverse)
library(tmaptools)
library(leaflet)
library(conflicted)
conflicts_prefer(dplyr::summarize)
conflicts_prefer(dplyr::filter)

Rugby league teams by location

There are 37 teams that play professional or semi-professional rugby league in Europe.¹ These are listed in the file at http://ritsokiguess.site/datafiles/rugby-league-teams.csv, including the name of each team, its location, and the league in which they play (from Super League, best, to League One, worst). Our aim is to make a map of the locations of the teams, to see what we can learn about where the teams tend to be from.

Read in and display (some of) the data.

my_url <- "http://ritsokiguess.site/datafiles/rugby-league-teams.csv"
teams <- read_csv(my_url)

Rows: 37 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): team, location, league

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

teams

Look up the latitudes and longitudes of the location where each team plays.

This is geocode_OSM, which needs some disentangling afterwards. Remember that geocode_OSM returns three things, so you need to put it in a list first. I saved this back into the same dataframe:

teams %>% 
  rowwise() %>% 
  mutate(latlong = list(geocode_OSM(location))) -> teams

This is the slowest part, so it’s best to save the answer so that you only do it once.

This is where we are at now:

teams

We want to take a look at the things in what I called latlong, which will be some kind of unnest. My take here is that I want to keep one row for each team, so I am going to unnest wider:

teams %>% unnest_wider(latlong)

The latitiudes and longitudes we want are in coords, so unnest_wider again:

teams %>% unnest_wider(latlong) %>% 
  unnest_wider(coords) -> teams
teams

Now we have, over on the right, columns x (longitudes) and y (latitudes). These places are all in the northern hemisphere, so the latitudes are positive, but some of the longitudes are negative (west of the Greenwich meridian) and some are positive (east of these). These are the right sign conventions for drawing maps.

Draw a map showing where these 37 teams play.

This is where leaflet comes in:

leaflet(data = teams) %>%
  addTiles() %>% 
  addCircleMarkers(lng = ~x, lat = ~y)

As an option, you can use the default markers, which look like Google map pins:

leaflet(data = teams) %>%
  addTiles() %>% 
  addMarkers(lng = ~x, lat = ~y)

I think I like the circles better, because the pins hide behind each other and you don’t know how dense they are until you zoom in.

Talking of zooming in, this works like a google map: click and drag to centre the map, and use the mouse wheel to zoom in or out (or click the + or - top left).

Where are most of the teams found?

Most of the teams are in a band across northern England from about Liverpool to Hull. There are some teams elsewhere: around the coast of north-west England, in Wales, in London (actually two teams²), in the south of France, and in other odd places. But the vast majority of the teams are in this narrow band.

What can you find out about why most of the teams are located where they are?

See what Google has to say. For example, I searched for “why are there so many rugby league teams in the north of England”, and found this article that tells the story pretty well. If you have played or watched rugby, the version of it you probably saw is called Rugby Union. It has 15 players a side, and “scrums” in which a mass of players fight for the ball. Rugby league has only 13 players a side and no scrums, and is a faster, more open game. The reason there are two very similar but distinct sports has to do with being paid to play the game. Rugby union players were traditionally “gentlemen” who had well-paying jobs or who were independently wealthy, and so rugby union was for a long time (even into my lifetime) an amateur game where players did not get paid to play. In the north of England, rugby was always a working-class game, and the best players had to take time off work to train and play, for which they could not at first be paid. This was the reason that rugby league became a separate game: it has been professional since the start. (Rugby union is now professional too.)

So, “because the northern players were working class, and they wanted to get paid for playing.”

(As far as England is concerned, there are rugby union teams all over the country, but mostly only rugby league teams in the north.)

Also part of the rugby league heartland was Cumbria, at the top left of the map (the three teams are the long-standing ones of Barrow, Whitehaven and Workington). The other teams are attempts to foster interest in rugby league in other parts of the country. One of the newest teams is Cornwall, at the bottom left of England. (The south-west of France has also been a place where rugby league is played, for reasons I know not.)

Re-draw your map, but now colouring the points according to which league the team plays in.

Borrow the idea from the clusters in class: make a colour palette first. Note that color in here has to be thus spelled:

pal <- colorFactor("Set1", teams$league)

Then use this with a color in your markers line:

leaflet(data = teams) %>%
  addTiles() %>% 
  addCircleMarkers(lng = ~x, lat = ~y, color = ~pal(league))

Go back to the data to determine which league is which. They are, from highest to lowest level of play: Super League (green), Championship (red), League 1 (blue).³ Apart from those two teams in the south of France, all the Super League teams are in that band across northern England, and most of the teams in non-traditional places (for Rugby League) are in the lowest-level League 1. Thus, attempts to grow the sport in other places have not been very successful.

Intoxicant use according to gender and race

In a survey, 2,276 high-school students were classified according to whether or not they have ever used alcohol, cigarettes, or marijuana (responses). In the survey, each student’s race and gender (as they reported them) was also recorded (explanatory). The data are in http://ritsokiguess.site/datafiles/intoxicant.csv. The columns are labelled by the initial letter of each of these, with a column count that says how many students fell into that combination of categories.

Read in and display some of the data. How do you know you have the correct total number of students?

The usual, to kick off with:

my_url <- "http://ritsokiguess.site/datafiles/intoxicant.csv"
use <- read_csv(my_url)

Rows: 32 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): a, c, m, r, g
dbl (1): count

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

use

The first five columns are respectively whether the students have ever used alcohol, cigarettes, or marijuana, then their race (classified as “white” or “other”), then their reported gender (classified as “male” or “female” only).

There are \(32 = 2^5\) rows, one for each of the two levels of each of the five categorical variables. But each row contains a lot of students, namely, the number in count, so to find out how many students here are altogether, sum up the values in count:

use %>% summarize(total = sum(count))

which checks with the question.

Extra: the tidyverse has a function amusingly called uncount that does the opposite of what count does:

use %>% uncount(count)

This produces one row for each individual student (you will have to scroll down a long way to get past the 405 rows of white female students who have used all three intoxicants), and hence 2,276 rows altogether.

Fit a log-linear model with up to two-way associations to these data. To do this, use (a+c+m+r+g)^2 on the right side of your model formula (instead of the a*c*r*m*g that you were probably expecting). Run a suitable drop1 on this model.

use.1 <- glm(count ~ (a+c+m+r+g)^2, data = use, family = "poisson")
drop1(use.1, test = "Chisq")

You see that in the drop1 output are all the two-way interactions (“associations” for this kind of model). This is what ^ does in a model formula, and is why we had to use I(x^2) to add an \(x\)-squared term in a regression, meaning “literally \(x\)-squared, not any kind of an interaction”.

The reason for doing it this way here is that if you start with a*c*r*m*g, you will have to eliminate the five-way interaction, then all the four-way interactions, then all the three-way interactions, before you even get down to business, where “remove” means “remove and re-fit, one by one”. I didn’t want to put you through that.

Build a better model. Why did you stop where you did?

Just as in regression, remove the least significant term, re-fit, then repeat until everything is significant and there is nothing left to drop.

First, remove c:r (P-value 0.51), then see what to do next:

use.2 <- update(use.1, . ~ . - c:r)
drop1(use.2, test = "Chisq")

Next is r:g, P-value 0.37:

use.3 <- update(use.2, . ~ . - r:g)
drop1(use.3, test = "Chisq")

Then c:g, P-value 0.33:

use.4 <- update(use.3, . ~ . - c:g)
drop1(use.4, test = "Chisq")

Those P-values are looking pretty small now. m:r, with P-value 0.084, seems to be the last one to remove:

use.5 <- update(use.4, . ~ . - m:r)
drop1(use.5, test = "Chisq")

and so it proves. Everything remaining is significant, with the largest remaining P-value being 0.019 and the others being a lot smaller than that. Thus, this is where we stop.

Extra: we have a lot of data here (over 2000 observations), so the effects, though significant, might not all be very big. This is something to bear in mind in the next question.

For each of your significant associations, draw a graph to explore them, and say what you conclude. Note that there is a logical distinction between associations that contain both a response variable and an explanatory one, and those that contain two variables of the same type.

The point of the log-linear analysis is to see which associations are worth looking at. If the association is not significant, there is no effect and that association does nothing to explain what is going on.

I begin by classifying the remaining associations by what type of variables they are:

a:c: response and response
a:m: response and response
a:r: response and explanatory
a:g: response and explanatory
c:m: response and response
m:g: response and explanatory

What we are looking for in this kind of analysis is how a response variable depends on explanatory variable(s), so the third, fourth, and sixth associations above are the most interesting. Let’s focus on them first.

In drawing a graph of a response and an explanatory, what you want to be able to say is “for each category of the explanatory variable, what happens to the response?”, so for those graphs you want the explanatory as x and the response as fill. So I’ll start with these. I think the best graph is stacked bars but scaled to have the same height so that you can compare proportions.⁴ The (relative) height of the bars is something we already have (in count), so this is geom_col with count in y, and not geom_bar which would count things for us.

In a:r, alcohol and race, the former is a response, so it goes as fill:

ggplot(use, aes(x = r, y = count, fill = a)) + geom_col(position = "fill")

The right-hand blue piece is slightly bigger, so if a student is white, they are more likely to have used alcohol than if they were a student of some other race. (This difference, though small, is significant, which is why we are looking at it.)

a:g is done the same way as a:r:

ggplot(use, aes(x = g, y = count, fill = a)) + geom_col(position = "fill")

If a student is female, they are very slightly more likely to have used alcohol than if they are male. This tiny difference is in fact significant, though the P-value is 0.019, which is not very small. Because there is so much data, significant effects can be small, and this appears to be one of those.

m:g is done the same way again:

ggplot(use, aes(x = g, y = count, fill = m)) + geom_col(position = "fill")

If a student is male, they are (a little) more likely to use marijuana than if the student is female. Again, a small but significant difference.

When you have two variables of the same type (in this problem, both responses), it doesn’t matter which one is x and which one is fill, as long as your explanation matches up: for each category of the thing in x, what happens to the thing in fill?

a:c:

ggplot(use, aes(x = a, y = count, fill = c)) + geom_col(position = "fill")

Students that have used alcohol are more likely to have used cigarettes as well (and the ones that haven’t used alcohol probably haven’t used cigarettes either). That is to say, alcohol and cigarette use go together. This is interesting, but it is a sort of secondary conclusion; you might imagine that the survey people were really interested in a question like “what kinds of young people tend to use alcohol or cigarettes or marijuana?” and this graph doesn’t help in answering that question.

Or, if you prefer:

ggplot(use, aes(x = c, y = count, fill = a)) + geom_col(position = "fill")

and the relevant conclusion is that if a student has used cigarettes, they have almost certainly used alcohol as well (and if they haven’t used cigarettes, they are much less likely to have used alcohol, even though the chance is not small).

The implication here is that these patterns are consistent over race. If they were not, we would have an a:c:r interaction, implying that race affects the combination of alcohol and cigarette use. But see the next question.

a:m are also both responses:

ggplot(use, aes(x = a, y = count, fill = m)) + geom_col(position = "fill")

Almost all the students who have not used alcohol have not used marijuana either, but of the students that have used alcohol, about half of them have used marijuana as well. Once again, students who have used one are more likely to have used the other as well.

Once again, the implication is that this pattern holds over all genders and races, otherwise we would have had a higher-order interaction in the model.⁵

Finally, c:m:

ggplot(use, aes(x = c, y = count, fill = m)) + geom_col(position = "fill")

A student who has not used cigarettes will most likely not have used marijuana either, but a student who has used cigarettes has a slightly better than even chance of having used marijuana: once again, the categorical-variable version of a “positive correlation”.

I wouldn’t give you a question this long on an exam, but on a worksheet, there is no problem with giving you plenty of practice.

We can also use step to do the model-building (rather than removing terms one by one). Starting from all three-way interactions, run step on this model, saving the result, and then run drop1 on that result. Is everything remaining significant? (Hint: copy and paste your code from question 8, and change the 2 to a 3.)

Following the hint(s). The output is rather long, but (i) you are not handing this in, and (ii) we have no problems with letting R do the work:

use.6 <- glm(count ~ (a+c+m+r+g)^3, data = use, family = "poisson")
use.7 <- step(use.6)

Start:  AIC=193.51
count ~ (a + c + m + r + g)^3

        Df Deviance    AIC
- c:m:r  1   5.2820 191.52
- m:r:g  1   5.3000 191.54
- a:m:g  1   5.3924 191.63
- a:m:r  1   5.4349 191.67
- a:c:g  1   5.5868 191.82
- a:c:m  1   5.6749 191.91
- c:r:g  1   6.4173 192.66
- c:m:g  1   6.7775 193.02
<none>       5.2720 193.51
- a:r:g  1   7.8942 194.13
- a:c:r  1   7.9474 194.19

Step:  AIC=191.52
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r + 
    c:g + m:r + m:g + r:g + a:c:m + a:c:r + a:c:g + a:m:r + a:m:g + 
    a:r:g + c:m:g + c:r:g + m:r:g

        Df Deviance    AIC
- m:r:g  1   5.3091 189.55
- a:m:g  1   5.4028 189.64
- a:m:r  1   5.4545 189.69
- a:c:g  1   5.5977 189.84
- a:c:m  1   5.6939 189.93
- c:r:g  1   6.4173 190.66
- c:m:g  1   6.7824 191.02
<none>       5.2820 191.52
- a:r:g  1   7.8976 192.14
- a:c:r  1   8.0316 192.27

Step:  AIC=189.55
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r + 
    c:g + m:r + m:g + r:g + a:c:m + a:c:r + a:c:g + a:m:r + a:m:g + 
    a:r:g + c:m:g + c:r:g

        Df Deviance    AIC
- a:m:g  1   5.4343 187.67
- a:m:r  1   5.4746 187.71
- a:c:g  1   5.6324 187.87
- a:c:m  1   5.7215 187.96
- c:r:g  1   6.5189 188.76
- c:m:g  1   6.8129 189.05
<none>       5.3091 189.55
- a:r:g  1   8.0975 190.34
- a:c:r  1   8.1046 190.34

Step:  AIC=187.67
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r + 
    c:g + m:r + m:g + r:g + a:c:m + a:c:r + a:c:g + a:m:r + a:r:g + 
    c:m:g + c:r:g

        Df Deviance    AIC
- a:m:r  1   5.5688 185.81
- a:c:g  1   5.7041 185.94
- a:c:m  1   5.8192 186.06
- c:r:g  1   6.6417 186.88
- c:m:g  1   6.8615 187.10
<none>       5.4343 187.67
- a:r:g  1   8.2153 188.45
- a:c:r  1   8.2529 188.49

Step:  AIC=185.81
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r + 
    c:g + m:r + m:g + r:g + a:c:m + a:c:r + a:c:g + a:r:g + c:m:g + 
    c:r:g

        Df Deviance    AIC
- a:c:g  1   5.8372 184.07
- a:c:m  1   5.9021 184.14
- c:r:g  1   6.7800 185.02
- c:m:g  1   7.0006 185.24
<none>       5.5688 185.81
- m:r    1   7.7021 185.94
- a:r:g  1   8.3839 186.62
- a:c:r  1   8.6868 186.93

Step:  AIC=184.08
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r + 
    c:g + m:r + m:g + r:g + a:c:m + a:c:r + a:r:g + c:m:g + c:r:g

        Df Deviance    AIC
- a:c:m  1   6.1515 182.39
- c:r:g  1   7.1650 183.40
- c:m:g  1   7.4487 183.69
<none>       5.8372 184.07
- m:r    1   7.9676 184.21
- a:r:g  1   8.9101 185.15
- a:c:r  1   9.1890 185.43

Step:  AIC=182.39
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r + 
    c:g + m:r + m:g + r:g + a:c:r + a:r:g + c:m:g + c:r:g

        Df Deviance    AIC
- c:r:g  1    7.475 181.71
- c:m:g  1    7.791 182.03
<none>        6.151 182.39
- m:r    1    8.319 182.56
- a:r:g  1    9.214 183.45
- a:c:r  1    9.529 183.77
- a:m    1   97.016 271.25

Step:  AIC=181.71
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r + 
    c:g + m:r + m:g + r:g + a:c:r + a:r:g + c:m:g

        Df Deviance    AIC
- c:m:g  1    9.215 181.45
<none>        7.475 181.71
- a:r:g  1    9.476 181.71
- m:r    1    9.613 181.85
- a:c:r  1   11.589 183.83
- a:m    1   98.342 270.58

Step:  AIC=181.45
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r + 
    c:g + m:r + m:g + r:g + a:c:r + a:r:g

        Df Deviance    AIC
- c:g    1    10.31 180.54
<none>         9.22 181.45
- a:r:g  1    11.26 181.50
- m:r    1    11.35 181.59
- a:c:r  1    13.39 183.63
- m:g    1    19.04 189.28
- a:m    1    99.62 269.86
- c:m    1   506.21 676.45

Step:  AIC=180.54
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r + 
    m:r + m:g + r:g + a:c:r + a:r:g

        Df Deviance    AIC
- a:r:g  1    12.24 180.48
<none>        10.31 180.54
- m:r    1    12.44 180.68
- a:c:r  1    14.36 182.60
- m:g    1    19.22 187.46
- a:m    1   100.73 268.97
- c:m    1   506.39 674.63

Step:  AIC=180.48
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r + 
    m:r + m:g + r:g + a:c:r

        Df Deviance    AIC
- r:g    1    13.05 179.29
<none>        12.24 180.48
- m:r    1    14.48 180.71
- a:c:r  1    16.32 182.56
- a:g    1    17.53 183.77
- m:g    1    21.32 187.56
- a:m    1   102.60 268.84
- c:m    1   508.34 674.58

Step:  AIC=179.29
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r + 
    m:r + m:g + a:c:r

        Df Deviance    AIC
<none>        13.05 179.29
- m:r    1    15.15 179.39
- a:c:r  1    17.13 181.37
- a:g    1    18.56 182.80
- m:g    1    21.95 186.19
- a:m    1   103.43 267.67
- c:m    1   509.16 673.39

drop1(use.7, test = "Chisq")

As you see, the process is rather lengthy, which is why I didn’t have you do it by hand.

There is one term in use.7, m:r, that is not significant (P-value 0.15). You could remove it if you wish.

In your final model from the previous question, are there any significant terms that you did not see previously? If so, in each case draw a suitable graph and say what it means.

There is one remaining three-way interaction a:c:r. One way of interpreting this is that race is associated with the alcohol-cigarettes combination.

With three variables, we have to use x, fill and facets. I think the one explanatory variable r belongs in facets (but you can experiment and see which you prefer):

ggplot(use, aes(x = a, y = count, fill = c)) + geom_col(position = "fill") + 
  facet_wrap(~ r)

The pattern in the two facets is more or less the same: if a student has used alcohol, they are more likely to have used cigarettes as well. Perhaps the reason for the interaction term being significant is that the no-no pattern is a bit more pronounced for the white students than it is for students of other races: if a white student has not used alcohol (left bar in the right facet), they are more likely to have not used cigarettes either, compared to a student of another race. (The three-way association is only just significant, with a P-value of 0.043, so we would not expect the effect to be very large, and indeed it is not.)

Footnotes

Rugby league is also played in Australia, traditionally around the city of Sydney, and in places in the Pacific like New Zealand and Papua New Guinea.↩︎
As I re-read this in 2025, there is now only one team in London, but there is a team in Goole (the Goole Vikings). You might wish to find that on your map.↩︎
There are actually two teams in Hull (green), and two in London (one red and one blue, which is why London appears to be purple.)↩︎
This is the only kind of “stacked bars” I like.↩︎
Except that we didn’t fit any higher-order interactions in the first place, but see the next question.↩︎