STAD29 Assignment 3

Packages

library(tidyverse)
library(nnet)
library(marginaleffects)

Latin-American voters in Chicago

In a city election in Chicago, three candidates are running, Garcia, Martinez, and Yanez. In the 22nd Ward, there are 27 precincts (voting districts). In each precinct, the number of votes cast for each candidate was recorded (as you would expect), but also each voter was asked whether they identified as Latin-American or not (there are a lot of Latin-Americans in this part of Chicago). Does the probability that a voter votes for each candidate in a precinct depend on what fraction of voters are Latin-American in that precinct? We will use the data in http://datafiles.ritsokiguess.site/chicago-election.csv to address this question. The columns in the dataset are as follows:

pr: the precinct number (identifier)
latv: number of voters in that precinct identifying as Latin-American
nonlv: number of voters in that precinct not identifying as Latin-American
turnout: total number of voters in that precinct
Garcia, Martinez, Yanez: number of votes cast for those candidates.

Note that turnout is the sum of latv and nonlv (each voter identified as Latin-American or not), but that turnout is typically more than the sum of Garcia, Martinez, and Yanez because some voting ballots were “spoiled” (that is, not completed in a way that the counter could see who was being voted for).

(1 point) Read in and display some of the data.

By now, nothing to it:

my_url <- "http://datafiles.ritsokiguess.site/chicago-election.csv"
election <- read_csv(my_url)

Rows: 27 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (7): pr, latv, nonlv, turnout, Garcia, Martinez, Yanez

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

election

Extra: this is more or less the data in E4.10 in the SenSrivastava package:

election0 <- SenSrivastava::E4.10
election0

However, the column names are in ALL CAPS, which I find REALLY ANNOYING to type, so I thought I would fix that for you. The function clean_names from the janitor package creates friendly and R-legal column names:

election0 %>% 
  janitor::clean_names()

but, having done that, it seemed disrespectful to have the candidate names all in lowercase. The simple way to fix this is to use rename on those three columns one at a time:

election0 %>% 
  janitor::clean_names() %>% 
  rename(Garcia = garcia, Martinez = martinez, Yanez = yanez) -> election0
election0

which is what I saved for you.

(2 points) Create a new column called pct_latin which is the proportion of Latin-American voters in each precinct. Save the resulting dataframe, and display (only) your new column from it.

This is mutate. You can either divide latv by turnout, or you can divide it by latv plus nonlv. The answer is the same either way:

election %>% mutate(pct_latin = latv / turnout) -> election
election %>% select(pct_latin)

The display of the column is to allow the grader to check your work. If you display the whole dataframe, the new column is at the end, and the grader may have to scroll across to see it. That will get very old very fast.

(2 points) In order to fit a model, we need one column containing all the numbers of votes for all the candidates, and a second column saying which candidate those votes were for. Arrange your dataframe so that this is true, and save the result.

This is in fact exactly the same thing we did with the coal miners data in lecture (and you will see another similarity shortly): pivot the three columns of votes longer:

election %>% 
  pivot_longer(Garcia:Yanez, names_to = "candidate", values_to = "votes") -> election_long
election_long

(3 points) Fit a suitable model, given the aim we are trying to address. Hint: compared to the coal-miners data in lecture, there is one similarity and one difference.

When we were looking at the coal-miners data in lecture, the three categories of disease were ordered (none, moderate, severe). But in this case, the candidates are not ordered in any way: they are just three names. That’s the difference from the coal miners. The similarity is that each row of the dataframe represents more than one thing: in this case, more than one vote, so that in the same way that we had weights in the model-fitting for the coal miners, so we do here as well.

The non-ordered response categories in candidate mean that we need to use multinom, predicting candidate from pct_latin with votes as weights:

election.1 <- multinom(candidate ~ pct_latin, data = election_long, weights = votes)

# weights:  9 (4 variable)
initial  value 6605.955692 
iter  10 value 5655.399876
final  value 5654.581887 
converged

This is all you need. As usual with this kind of model, you can look at the summary, but it doesn’t tell you much:

summary(election.1)

Call:
multinom(formula = candidate ~ pct_latin, data = election_long, 
    weights = votes)

Coefficients:
         (Intercept) pct_latin
Martinez   -1.954640  2.534511
Yanez      -2.358285  1.516940

Std. Errors:
         (Intercept) pct_latin
Martinez   0.1418353 0.2388857
Yanez      0.1839202 0.3124423

Residual Deviance: 11309.16 
AIC: 11317.16

The Coefficients table tells us, roughly, that compared to the baseline Garcia, the probability of voting for either of the two other candidates starts out lower (when pct_latin is zero), but, again compared to Garcia, the probabilities of voting for these two candidates increase faster as pct_latin increases (especially for Martinez). We’ll see later how this plays out with the predictions.

Extra: if you forget the weights, you are assuming that each candidate gets one vote in each precinct, which completely misses the idea that there may be a reason that some candidates get more votes in some precincts than others:

election.1x <- multinom(candidate ~ pct_latin, data = election_long)

# weights:  9 (4 variable)
initial  value 88.987595 
final  value 88.987595 
converged

As a result, the predictions (copying something we will do on the correct model later) make no sense:

new <- datagrid(model = election.1x, pct_latin = c(0.2, 0.4, 0.6, 0.8))
cbind(predictions(election.1x, new)) %>% 
  select(group, estimate, pct_latin) %>% 
  pivot_wider(names_from = group, values_from = estimate)

Regardless of the proportion of Latin-American voters in each precinct, a voter is equally likely to vote for each of the candidates according to this model. This seems just a bit unlikely, and if it happens to you (later), it is definitely something to make you wonder whether you did the right thing. Especially if you go back to the data and work out how many votes were cast for each candidate altogether:

election_long %>% 
  group_by(candidate) %>% 
  summarize(vote_total = sum(votes))

it seems that at the very least, the probability of voting for Yanez should be smaller than for the other candidates.

Another (more difficult) way to get to this is to sum up the appropriate columns of the original dataframe:

election %>% 
  summarize(across(Garcia:Yanez, \(x) sum(x)))

(2 points) Demonstrate that there is a significant effect of pct_latin on the probabilities of voting for each candidate.

Your first thought is likely to be drop1, but if you try it, this happens:

drop1(election.1, test = "Chisq")

trying - pct_latin

Error in if (trace) {: argument is not interpretable as logical

So what you have to do is to fall back on anova, comparing the fit of a model including pct_latin and one excluding it. This is the procedure I showed you second for the coal miners (because in that case drop1 worked), though it’s actually in the lecture notes first:

election.0 <- update(election.1, . ~ . -pct_latin)

# weights:  6 (2 variable)
initial  value 6605.955692 
final  value 5724.622504 
converged

To my mind, this is the best way of removing pct_latin from our model, but if you prefer to copy, paste, and edit, you have to find something to replace pct_latin with that indicates a model with no \(x\)-variables:

election.0a  <-  multinom(candidate ~ 1, data = election_long, weights = votes)

# weights:  6 (2 variable)
initial  value 6605.955692 
final  value 5724.622504 
converged

Then you run anova on your two models, smaller first, remembering that these models are maximum likelihood so your test is Chisq:

anova(election.0, election.1, test = "Chisq")

The P-value is basically zero, so the bigger model election.1 fits significantly better than the model election.0 with just an intercept. That is to say, we really do need to keep pct_latin in the model, because it has a significant effect on voting.

(3 points) For values of pct_latin of 0.2, 0.4, 0.6, 0.8, obtain predicted probabilities of voting for each candidate, displayed in a way that clearly shows the influence of pct_latin on the probabilities.

There are about three things to do:

get a dataframe new that contains the values you want to predict for
obtain the predictions, keeping only the relevant columns
rearrange the predictions to show the effect of pct_latin on the predicted probabilities.

These multi-stage questions may seem like a lot; if they do, take it one step at a time and then figure out what to do next.

First, get a dataframe new of values to predict for. Pretty much any way you have of getting this will work: as long as you have a column called pct_latin with the right values in it, you are good here, whether you use datagrid or tibble or tribble to get it. I went the first way:

new <- datagrid(model = election.1, pct_latin = c(0.2, 0.4, 0.6, 0.8))
new

It’s clearest to put the model first here, and you also need to name it so that datagrid can distinguish things to predict for from the model. It doesn’t matter here which way you do it because there is only one explanatory variable; if you had another one that you wanted to hold constant at its mean (so as to isolate the effect of pct_latin), datagrid would be the easiest way to make that work, but here there are no such problems. The other columns here are the most common candidate and the mean number of votes across all candidates and precincts, neither of which we need (they will be ignored).

Next, the predictions:

cbind(predictions(election.1, new))

Now that you have this, work out what you need to keep: the response category (candidate) in group, the predicted probabilities in estimate, and the value of pct_latin you were making predictions for:

cbind(predictions(election.1, new)) %>% 
  select(group, estimate, pct_latin)

Now, we want to compare the predicted probabilities for the same pct_latin value for each candidate, so we want to pivot this wider, using the names of the candidates (in group) as column names, filling the columns with the values in estimate:

cbind(predictions(election.1, new)) %>% 
  select(group, estimate, pct_latin) %>% 
  pivot_wider(names_from = group, values_from = estimate)

Extra: I think the above is the best way, but I suppose it also works to “transpose” the layout:

cbind(predictions(election.1, new)) %>% 
  select(group, estimate, pct_latin) %>% 
  pivot_wider(names_from = pct_latin, values_from = estimate)

Now the column names are values of pct_latin, and the rows are candidates.

(2 points) How does the probability of a vote for each candidate depend on the proportion of Latin-American voters in that precinct? Explain briefly.

When the proportion of Latin-American voters is small, voters are likely to vote for Garcia and unlikely to vote for either of the other two candidates. But when it is large, voters are about equally likely to vote for Garcia and Martinez.

Or, talk about how things change as pct_latin changes: as the proportion of Latin-American voters increases, the probability of voting for Garcia decreases, the probability of voting for Martinez increases, and the probability of voting for Yanez remains near constant (and low).

Extra: in this kind of model, a significant effect shows up most clearly in the predictions. If there had not been a pct_latin effect, the probability of a vote for each candidate would barely have changed as pct_latin changed.

(2 points) Obtain a graph of the predictions. (You do not need to include the data; just the predictions is fine.)

This is plot_predictions. Bear in mind that the three things we kept from cbind(predictions) will need to go on the graph somewhere:

the Group will be colour
the estimate will go on the \(y\)-axis (which we don’t have to say explicitly)
the pct_latin will go on the \(x\)-axis.

Hence:

plot_predictions(election.1, condition = c("pct_latin", "group"))

This shows the same thing as your predictions did: as pct_latin increases, the probability of a vote for Garcia decreases, of a vote for Martinez increases, and of a vote for Yanez is small and near constant (or slightly increasing).

Weather in Halifax, Nova Scotia

Halifax, Nova Scotia, is close to the Atlantic Ocean, and so it gets a lot of snow. Some daily snowfall data for the city is in http://datafiles.ritsokiguess.site/halifax-snow.csv. The snowfall amounts are in centimetres.

(1 point) Read in and display (a little of) the data.

As always:

my_url <- "http://datafiles.ritsokiguess.site/halifax-snow.csv"
halifax <- read_csv(my_url)

Rows: 2753 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): date
dbl (1): snowfall

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

halifax

There are over 2700 rows, one per day, starting in 1967 and ending on

halifax %>% slice(2753)

As a quick mental calculation, the 365 days in a year is about 400 minus 10 percent, and we have daily data for about 7 1/2 years, so we should have about \(7.5 \times 400 = 3000\) minus 10 percent, that is, \(3000- 300 = 2700\) rows. This is pretty much what we have.

Extra: these data came from Environment Canada. If you want to play with this yourself (there are a lot more cities and a lot of things other than just snowfall), point your web browser at https://climate-change.canada.ca/climate-data/#/daily-climate-data. Find the city and weather station you’re interested in, either by typing its name in the Search box or by zooming in on the map (which limits the names that appear), leave Download Format as CSV, then click on Retrieve Download Links. Below that will appear one or more links (if the weather record spans a lot of days, it will be split up). Click a link to download a CSV file, which will be on your computer wherever downloads go. I copied it from there to the folder I was working in, and:

my_url <- "climate-daily.csv"
halifax1 <- read_csv(my_url)

Rows: 2753 Columns: 36
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): STATION_NAME, ID, PROVINCE_CODE
dbl  (15): x, y, CLIMATE_IDENTIFIER, LOCAL_YEAR, LOCAL_MONTH, LOCAL_DAY, MEA...
lgl  (17): MEAN_TEMPERATURE_FLAG, MIN_TEMPERATURE_FLAG, MAX_TEMPERATURE_FLAG...
dttm  (1): LOCAL_DATE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

halifax1

There are a ton of columns, most of which have names that are CAPITAL LETTERS. The column called LOCAL_DATE actually is a date, and the column called TOTAL_SNOW is that day’s snowfall. (There is also a column called SNOW_ON_GROUND which is the total of the snow that fell and the snow that fell previously but hasn’t melted yet.)

So, what I did for you was:

use janitor::clean_names to make the column names lowercase
give the snowfall column a better name than total_snow
turn the date back into text (it got read in as a date because it was YYYY-MM-DD ISO format) (more detail on this below)
save only the date-as-text and snowfall columns for you:

my_url <- "climate-daily.csv"
halifax1 <- read_csv(my_url)

Rows: 2753 Columns: 36
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): STATION_NAME, ID, PROVINCE_CODE
dbl  (15): x, y, CLIMATE_IDENTIFIER, LOCAL_YEAR, LOCAL_MONTH, LOCAL_DAY, MEA...
lgl  (17): MEAN_TEMPERATURE_FLAG, MIN_TEMPERATURE_FLAG, MAX_TEMPERATURE_FLAG...
dttm  (1): LOCAL_DATE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

halifax1 %>% janitor::clean_names() %>% 
  select(local_date, snowfall = total_snow) %>% 
  mutate(date = strftime(local_date + hours(12), format = "%a %b %e, %Y")) %>% 
  select(date, snowfall)

lubridate has several functions for turning text into dates, but not for the reverse. There is a standard way of doing that, which is called strftime. This accepts as input a date (or date-time) plus a “format string”, which says in code how to display the date (datetime). There are lots of codes (see the bottom of the strftime help, ?strftime in the console, for all of them), but the ones I used are:

%a: abbreviated name of day of the week
%b: abbreviated month name
%e: day of the month (with a blank on the front if it is less than 10)
%Y: the four-digit year

The reason for adding 12 hours is that a date without a time is considered to be midnight at the beginning of the date, and I was having trouble with the dates coming out a day too early, so I thought that if I made it noon in the middle of the day in question, there would be no doubt what date it was.

There is also (in base R) a function called strptime which turns a piece of text into a date, the same way that dmy and friends do, but you have to give it a format string that says what the various bits of text are, using the same codes as for strftime, so it is a good deal harder to use than the much more flexible and friendly functions in lubridate. Plus, it is not mentioned in the course at all (other than here), so there is no credit in this course for using it.

This idea of a format string comes from the C programming language (from where a lot of R’s non-statistical ideas came). It was first used to format numbers (eg. to display them to a fixed number of decimal places) using codes beginning with a % to say how you wanted the number formatted, such as %f to display a “floating-point”, ie. decimal, number. The idea lives on in Python:

pi = 3.1415926
"%6.2f" % pi

'  3.14'

"%6.4f" % pi

'3.1416'

"%12.5e" % 0.05

' 5.00000e-02'

The format code is in the quotes on the left. In between the % and the f are the total number of characters to use, and the number of decimal places to display. If the display doesn’t use that many characters, extra blanks are added on the left. For example, 3.14 is only 4 characters, not 6, so two blanks are added on the left to make the whole thing 6 characters. The bottom one uses the %e format code to force scientific notation; the whole thing is 12 characters wide, and there are 5 decimal places; read this as \(5 \times 10^{-2}\).

R also has sprintf, which was stolen wholesale from C:

sprintf("%6.2f", pi)

[1] "  3.14"

sprintf("%6.4f", pi)

[1] "3.1416"

sprintf("%12.5e", 0.05)

[1] " 5.00000e-02"

The format codes in strftime are a takeoff on this idea; the people behind strftime realized that you could use the same approach to format dates and times in the way you wanted.

(3 points) In your dataframe, what kind of thing is your date column? Convert this column into R dates.

These are text (see the chr at the top of the column in the display of the dataframe).

To convert them into dates, figure out the order of the year, month, and day. They are month first, then day of the month, then year. So mdy is the function we need. (The days of the week will be ignored.)

halifax %>% 
  mutate(actual_date = mdy(date)) -> halifax
halifax

You might have a bit of trouble dreaming up a name for your new column. My first thought was actual_date_no_really, but I decided that this was too silly.

These dates look correct, and the days of the week have indeed been ignored. I saved the dataframe on the grounds that we are likely to need it again.

Extra: I think the days of the week can even be wrong. Let me experiment. I am writing this on Friday Jan 23, 2026:

mdy("Friday Jan 23, 2026")

[1] "2026-01-23"

which works, but what about

mdy("Thursday Jan 23, 2026")

[1] "2026-01-23"

which has the wrong day of the week, but happily reads in as the correct date.

(3 points) Which four months had the most days with (non-zero) snowfall? Describe your process.

The first thing is to get hold of the month, which you can do as a name or a number. I’ll show how to do it as a month name:

halifax %>% 
  mutate(month_name = month(actual_date, label = TRUE))

If you leave off the label, you get month numbers, which is fine here as long as you translate back to month numbers when you give your final answer:

halifax %>% 
  mutate(month = month(actual_date))

Next, we want to only count days with a nonzero amount of snowfall:

halifax %>% 
  mutate(month_name = month(actual_date, label = TRUE)) %>% 
  filter(snowfall > 0)

and now we can count these:

halifax %>% 
  mutate(month_name = month(actual_date, label = TRUE)) %>% 
  filter(snowfall > 0) %>% 
  count(month_name)

Note that the month names are displayed in the correct order. This is because the month names are what’s called an “ordered factor” (note the ord at the top of the month name column).

The last step is to find the four months with the largest number of days with snowfall, which is slice_max (or arrange followed by slicing off the top four):

halifax %>% 
  mutate(month_name = month(actual_date, label = TRUE)) %>% 
  filter(snowfall > 0) %>% 
  count(month_name) %>% 
  slice_max(n, n = 4)

January, February, March, and December, in that order. If you used the month numbers, make sure you name the months when you say which ones have the most days with snowfall.

There are undoubtedly other ways to get to this answer. If you have what looks like a sensible method that gives the same answer and you can talk about what you did, you should be good.

By “describe your process”, I mean to talk about what you did or what you were thinking as you did this, with enough detail that your reader can follow what you did and is convinced that what you did is correct. You can either do one step at a time and talk about it before moving on (as I did), or you can present the whole pipeline and talk about what each line of code does. You have some flexibility; for example, you can do the filter first, and then get the months (month names).

I don’t know whether it surprises you or not that there were more snowfall days in March than there were in December. I do remember, from my time living there, that winter in Nova Scotia seems to go on for ever.

(3 points) What is the latest date in a winter that any snow fell? Describe how you found out.

The latest date in a year that snow fell is undoubtedly December 31, but in the Northern Hemisphere, a winter spans two years, starting in about November and continuing until April or so the next year. Let’s go back to the snowfall days by month:

halifax %>% 
  mutate(month_name = month(actual_date, label = TRUE)) %>% 
  filter(snowfall > 0) %>% 
  count(month_name)

The latest snowfall day is sometime in May (there were three snowfall days in May); after that is summer, and October is part of the next winter. So now that we know the answer is “sometime in May”, we can search for the three snowfall days in May and see which one is the latest:

halifax %>% 
  mutate(month_name = month(actual_date, label = TRUE)) %>% 
  filter(snowfall > 0) %>% 
  filter(month_name == "May")

In 1974, it actually snowed on May 27!¹

As a more automatic way to get the answer, note that July (or August) is in the middle of summer, so the answer must be before July. So we restrict ourselves to looking at days before July, on which snow fell, and then we take the latest of those. Since we are comparing months to July (7th month), it seems clearer to use month numbers for this, and we’ll need day numbers within the month as well to make the sorting work:

halifax %>% 
  mutate(month_number = month(actual_date)) %>% 
  mutate(day_number = mday(actual_date)) %>% 
  filter(snowfall > 0) %>% 
  filter(month_number <= 7) %>% 
  slice_max(tibble(month_number, day_number), n = 1)

May 27, which happened in 1974.

This requires a little care. The above is how you use slice_max on two columns, using the second one to break ties in the first (as in arrange). I found this in the help for slice_max; specifically, in one of the examples at the bottom.

If you don’t see this, arrange plus slice is perfectly good here:

halifax %>% 
  mutate(month_number = month(actual_date)) %>% 
  mutate(day_number = mday(actual_date)) %>% 
  filter(snowfall > 0) %>% 
  filter(month_number <= 7) %>% 
  arrange(desc(month_number), desc(day_number)) %>% 
  slice(1)

My process is:

obtain the month and day of each date as numbers
grab only the days where there was snowfall
out of those, grab only the dates in July or earlier (or some other suitable month)
find the latest month (of the days remaining), and, of those, the latest day within the month.

Extra: lubridate also has a function yday that returns the day of the year, 1 being Jan 1 and 365 being Dec 31, which saves some messing about:

halifax %>% 
  mutate(year_day = yday(actual_date)) %>% 
  filter(snowfall > 0) %>% 
  filter(year_day < 180) %>% 
  slice_max(year_day, n = 1)

I guessed that the last day of snow could not be as much as halfway through the year, hence restricting myself to year_day less than 180. This time, slice_max works smoothly (we want the row with the maximum value of only one variable). But if you go this way, you need to say where you found out about yday: a website, or the help page for day or wday which includes the help for yday.

Points: \((1+2+2+3+2+3+2+2) + (1+3+3+3) = 17 + 10 = 27\)

Footnotes

Also, look at how much snow fell on May 9 and 10, 1972.↩︎