Worksheet 4

Published

January 22, 2026

Packages

library(tidyverse)
library(marginaleffects)
library(nnet)
# library(MASS, exclude = "select")

The first problem is a repeat from Worksheet 3, in case you didn’t do it before. The other problems are on dates and times.

Palmer penguins

The penguins dataset in the package palmerpenguins contains body measurements and other information for 344 adult penguins that were observed at the Palmer Archipelago in Antarctica. Variables of interest to us are:

species: the species of the penguin (one of Adelie, Chinstrap, Gentoo)
bill_length_mm: bill length in mm
bill_depth_mm: bill depth (from top to bottom) in mm
flipper_length_mm: flipper length in mm
body_mass_g: body mass in grams
sex: whether the penguin is male or female

In particular, is it possible to predict the species of a penguin from the other variables?

Load the package (installing it first if you need to) and display the dataset.

That means this:

library(palmerpenguins)


Attaching package: 'palmerpenguins'

The following objects are masked from 'package:datasets':

    penguins, penguins_raw

penguins

There are some missing values in the dataset. Remove all the observations that have missing values on any of the variables, and save the resulting dataset.

This is what drop_na does. You can either feed it some columns to check for missings in, or if you feed it no columns at all, it will check for missings everywhere. I saved it back into the original dataset, but you can certainly create a new one if you prefer:

penguins %>% drop_na() -> penguins

If you have another way that is consistent with the way we have learned things, that is also good. (Ideas from base R are not consistent with our approach.)

Extra: to convince myself that all the missing values are indeed gone:

summary(penguins)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :146   Biscoe   :163   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :123   1st Qu.:39.50   1st Qu.:15.60  
 Gentoo   :119   Torgersen: 47   Median :44.50   Median :17.30  
                                 Mean   :43.99   Mean   :17.16  
                                 3rd Qu.:48.60   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172       Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190       1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197       Median :4050                Median :2008  
 Mean   :201       Mean   :4207                Mean   :2008  
 3rd Qu.:213       3rd Qu.:4775                3rd Qu.:2009  
 Max.   :231       Max.   :6300                Max.   :2009

and none of the columns do indeed have any missing values remaining in them. There appear to be 333 penguins left,¹ so we had to get rid of 11 of them.

Why is polr from MASS not appropriate for fitting a model to predict species from the other variables?

polr assumes that the categorical response variable species has ordered categories, but in this case the species are just names or labels, and there is no (unambiguous) way to put them in order (such as from “low” to “high”).

Hence, looking just ahead, we will need to use multinom from the nnet package to fit our models. Make sure you have nnet installed.

Fit a suitable model predicting species from the other variables (the fitting process will produce some output).

From the previous question, we have ruled out polr, so:

penguins.1 <- multinom(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm +
                         body_mass_g + sex, data = penguins)

# weights:  21 (12 variable)
initial  value 365.837892 
iter  10 value 16.719027
iter  20 value 3.047829
iter  30 value 0.103484
iter  40 value 0.063985
iter  50 value 0.006201
iter  60 value 0.003924
iter  70 value 0.001733
iter  80 value 0.001547
iter  90 value 0.001513
iter 100 value 0.001498
final  value 0.001498 
stopped after 100 iterations

Extra: should you be curious:

summary(penguins.1)

Call:
multinom(formula = species ~ bill_length_mm + bill_depth_mm + 
    flipper_length_mm + body_mass_g + sex, data = penguins)

Coefficients:
          (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
Chinstrap -103.422229       16.51642     -24.41758       -0.31003929
Gentoo      -6.756321       12.21924     -30.09757       -0.08758996
           body_mass_g    sexmale
Chinstrap -0.030200928 -15.820607
Gentoo     0.005019684  -8.132101

Std. Errors:
          (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
Chinstrap   0.5715154      27.749774      27.43640          4.097365
Gentoo      0.5715734       8.968787      33.25608         11.905692
          body_mass_g  sexmale
Chinstrap   0.1350169 3.077521
Gentoo      0.5848602 3.498288

Residual Deviance: 0.002996934 
AIC: 24.003

This doesn’t really tell us very much. The coefficients are kind of slopes (treating the first category Adelie as baseline), and there are some log-odds hiding in the background, but for knowing things of actual interest to us, like “can we remove anything from the model?”, this is not what to look at.

Use step to see whether anything can be removed from your model. Which variables remain after step has finished? (Hints: save the output from step, because this is actually the best model that step found. Also, the fitting process will probably produce a lot of output, which, for this question, you can include in your assignment.)

First off, you are probably thinking drop1, but remember that this doesn’t work for these models (no, I don’t know why not either):

drop1(penguins.1, test = "Chisq")

trying - bill_length_mm

Error in if (trace) {: argument is not interpretable as logical

so step it is:

penguins.2 <- step(penguins.1)

Start:  AIC=24
species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + 
    body_mass_g + sex

trying - bill_length_mm 
# weights:  18 (10 variable)
initial  value 365.837892 
iter  10 value 129.185715
iter  20 value 114.018556
iter  30 value 113.336460
iter  40 value 113.216108
final  value 113.215886 
converged
trying - bill_depth_mm 
# weights:  18 (10 variable)
initial  value 365.837892 
iter  10 value 66.463118
iter  20 value 12.514139
iter  30 value 7.743531
iter  40 value 5.662963
iter  50 value 4.350499
iter  60 value 4.145847
iter  70 value 4.093510
iter  80 value 3.984724
iter  90 value 3.903126
iter 100 value 3.831010
final  value 3.831010 
stopped after 100 iterations
trying - flipper_length_mm 
# weights:  18 (10 variable)
initial  value 365.837892 
iter  10 value 25.561758
iter  20 value 1.350538
iter  30 value 0.065803
iter  40 value 0.020499
iter  50 value 0.016911
iter  60 value 0.012712
iter  70 value 0.005348
iter  80 value 0.001290
iter  90 value 0.000225
iter 100 value 0.000216
final  value 0.000216 
stopped after 100 iterations
trying - body_mass_g 
# weights:  18 (10 variable)
initial  value 365.837892 
iter  10 value 9.579038
iter  20 value 4.114389
iter  30 value 2.340467
iter  40 value 0.614487
iter  50 value 0.055642
iter  60 value 0.012303
iter  70 value 0.007327
iter  80 value 0.007163
iter  90 value 0.007015
iter 100 value 0.006995
final  value 0.006995 
stopped after 100 iterations
trying - sex 
# weights:  18 (10 variable)
initial  value 365.837892 
iter  10 value 16.321214
iter  20 value 3.754897
iter  30 value 1.631859
iter  40 value 0.012427
iter  50 value 0.001125
iter  60 value 0.001108
iter  70 value 0.001006
iter  80 value 0.000906
iter  90 value 0.000498
iter 100 value 0.000498
final  value 0.000498 
stopped after 100 iterations
                    Df       AIC
- flipper_length_mm 10  20.00043
- sex               10  20.00100
- body_mass_g       10  20.01399
<none>              12  24.00300
- bill_depth_mm     10  27.66202
- bill_length_mm    10 246.43177
# weights:  18 (10 variable)
initial  value 365.837892 
iter  10 value 25.561758
iter  20 value 1.350538
iter  30 value 0.065803
iter  40 value 0.020499
iter  50 value 0.016911
iter  60 value 0.012712
iter  70 value 0.005348
iter  80 value 0.001290
iter  90 value 0.000225
iter 100 value 0.000216
final  value 0.000216 
stopped after 100 iterations

Step:  AIC=20
species ~ bill_length_mm + bill_depth_mm + body_mass_g + sex

trying - bill_length_mm 
# weights:  15 (8 variable)
initial  value 365.837892 
iter  10 value 136.772749
iter  20 value 133.782638
iter  30 value 133.545483
final  value 133.538635 
converged
trying - bill_depth_mm 
# weights:  15 (8 variable)
initial  value 365.837892 
iter  10 value 30.179445
iter  20 value 5.699489
iter  30 value 5.175455
iter  40 value 5.153196
iter  50 value 5.070255
iter  60 value 4.997789
iter  70 value 4.893410
iter  80 value 4.826209
iter  90 value 4.708542
iter 100 value 4.695168
final  value 4.695168 
stopped after 100 iterations
trying - body_mass_g 
# weights:  15 (8 variable)
initial  value 365.837892 
iter  10 value 11.953428
iter  20 value 3.975600
iter  30 value 2.172365
iter  40 value 2.111553
iter  50 value 1.881400
iter  60 value 1.236767
iter  70 value 1.133364
iter  80 value 1.059253
iter  90 value 0.588576
iter 100 value 0.490499
final  value 0.490499 
stopped after 100 iterations
trying - sex 
# weights:  15 (8 variable)
initial  value 365.837892 
iter  10 value 20.098944
iter  20 value 3.621850
iter  30 value 1.160548
iter  40 value 1.108738
iter  50 value 1.063387
iter  60 value 0.958674
iter  70 value 0.926099
iter  80 value 0.837719
iter  90 value 0.797644
iter 100 value 0.741759
final  value 0.741759 
stopped after 100 iterations
                 Df       AIC
- body_mass_g     8  16.98100
- sex             8  17.48352
<none>           10  20.00043
- bill_depth_mm   8  25.39034
- bill_length_mm  8 283.07727
# weights:  15 (8 variable)
initial  value 365.837892 
iter  10 value 11.953428
iter  20 value 3.975600
iter  30 value 2.172365
iter  40 value 2.111553
iter  50 value 1.881400
iter  60 value 1.236767
iter  70 value 1.133364
iter  80 value 1.059253
iter  90 value 0.588576
iter 100 value 0.490499
final  value 0.490499 
stopped after 100 iterations

Step:  AIC=16.98
species ~ bill_length_mm + bill_depth_mm + sex

trying - bill_length_mm 
# weights:  12 (6 variable)
initial  value 365.837892 
iter  10 value 150.109767
iter  20 value 142.419807
iter  30 value 142.047323
final  value 142.041293 
converged
trying - bill_depth_mm 
# weights:  12 (6 variable)
initial  value 365.837892 
iter  10 value 139.953898
iter  20 value 133.697479
iter  30 value 133.399000
iter  40 value 133.326376
final  value 133.325373 
converged
trying - sex 
# weights:  12 (6 variable)
initial  value 365.837892 
iter  10 value 26.650783
iter  20 value 23.943597
iter  30 value 23.916873
iter  40 value 23.901339
iter  50 value 23.895442
iter  60 value 23.894251
final  value 23.892065 
converged
                 Df       AIC
<none>            8  16.98100
- sex             6  59.78413
- bill_depth_mm   6 278.65075
- bill_length_mm  6 296.08259

This is where you get rather a lot of output. step has to check the effect of removing each explanatory variable from each model it fits, and multinom itself produces half a dozen or so lines of output each time it is run.

To see which variables are left, either read them off from the output here, or look at summary of your new model just enough to see which variables it has in it:

summary(penguins.2)

Call:
multinom(formula = species ~ bill_length_mm + bill_depth_mm + 
    sex, data = penguins)

Coefficients:
          (Intercept) bill_length_mm bill_depth_mm     sexmale
Chinstrap   -242.9740       14.88938     -21.91411 -36.9051125
Gentoo       121.0476       14.80645     -44.69711  -0.1301506

Std. Errors:
          (Intercept) bill_length_mm bill_depth_mm  sexmale
Chinstrap    3.985819       9.369182      23.16685 23.33386
Gentoo       4.071031       9.413971      23.17697 43.25185

Residual Deviance: 0.9809986 
AIC: 16.981

Whichever way you slice it, bill length, bill depth, and sex remain; flipper length and body mass have been removed.

For the predictions below, use this model.

Obtain predicted probabilities of each species from your best model for all nine combinations of: bill length 39.6, 40.0, and 40.4, bill depth 15.8, 16.0, and 16.2, sex female (only). Display those predicted probabilities in a way that allows for easy comparison of the three species probabilities for given values of the other variables.

Three stages, as we have seen before (a point each):

make a dataframe (perhaps called new) of all the combination of explanatory variables
do the predictions
rearrange the predictions so that they are easy to compare.

For the first step: datagrid does all possible combinations, so that is the way to go. When there are a lot of things that you want combinations of, it can be clearer to define them first. The best model is the one output by step, which I called penguins.2:

lengths <- c(39.6, 40.0, 40.4)
depths <- c(15.8, 16.0, 16.2)
sexes <- "female"
new <- datagrid(model = penguins.2, bill_length_mm = lengths, bill_depth_mm = depths, sex = sexes)
new

There are \(3 \times 3 \times 1 = 3^2 = 9\) combinations of values. My habit, when I define the values first like this, is to use a plural name for the values, so that when I construct new, it’s “singular = plural” all the way through.

If you prefer to put the actual values inside the datagrid, you can do that, but then you will need to split the line of code so that it is easier to see all the values, for example a new line for each variable, like this:

new0 <- datagrid(model = penguins.2, 
                 bill_length_mm = c(39.6, 40.0, 40.4),
                 bill_depth_mm = c(15.8, 16.0, 16.2),
                 sex = "female")
new0

Doing it this way, with each variable on a new line, makes it easier to read.

Second, the predictions. Save the explanatory variable names, the column group that says which species the predicted probability is for, and the column estimate of the predicted probabilities themselves:

cbind(predictions(model = penguins.2, newdata = new)) %>%
  select(group, estimate, bill_length_mm:sex)

Third, pivot-wider the estimated probabilities into columns named for the species. Mine came out wider than my display, so I additionally rounded off the probabilities to make the columns narrower:

cbind(predictions(model = penguins.2, newdata = new)) %>%
  select(group, estimate, bill_length_mm:sex) %>% 
  pivot_wider(names_from = group, values_from = estimate) %>% 
  mutate(across(Adelie:Gentoo, \(x) round(x, 3))) # for each column from Adelie to Gentoo, round it to 3 decimals

A second way to do this is to round off the predicted probabilities before you pivot-wider, which is easier to do but harder to see that you can do it this way. Run the code as far as the select to figure out how you need to do it: there’s a column called estimate with all the predicted probabilities in it, so round this column to 3 decimals before you proceed with the pivot_wider:

cbind(predictions(model = penguins.2, newdata = new)) %>%
  select(group, estimate, bill_length_mm:sex) %>% 
  mutate(estimate = round(estimate, 3)) %>% 
  pivot_wider(names_from = group, values_from = estimate)

Extra: I had to play around a bit with the numbers I wanted you to get predictions for. The problem is that the species are actually very distinct (as you will see with your graph), and so if you are not careful with your values, you will get a lot of predicted probabilities that are close to zero and one. I ended up restricting things to just females for this question because (as it turns out) males have a similar pattern but are just bigger overall, and the pattern was easier to see with just one sex.

Using your predictions, what combinations of values for bill length and bill depth distinguish the three species?

Find the largest predicted probabilities for each species and say something about whether they go with small or large values on each variable:

Adelie have small bill length and large bill depth
Chinstrap have large bill length and medium or large bill depth
Gentoo have small bill depth (but can have any bill length).

Another way you can go is to look at the combinations of large and small values on each explanatory variable:

large bill length and large bill depth is probably Chinstrap
large bill length and small bill depth is probably Gentoo
small bill length and large bill depth is (almost certainly) Adelie
small bill length and small bill depth is probably Gentoo (there actually aren’t really any of these, as we’ll see from a graph shortly).

Make a plot that includes bill length, bill depth, species, and sex. Does your plot support your conclusions from the previous question?

This plot is one you would make from the data, not of the predictions from the model (which is rather hard to interpret).

This has two quantitative variables (the first two) and two categorical (the last two). If you only had one categorical variable, this would be a scatterplot with the levels of the categorical variable as colour. The procedure we have seen for dealing with “too many” categorical variables is to use facets. My take is that sex is explanatory, so that is better as facets, and hence:

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, colour = species)) +
  geom_point() + facet_wrap(~ sex)

I am deliberately not using scales = "free" because I want to be able to compare where the points are on the two facets.

I have to say that I think these graphs are actually the best way towards understanding here. The benefit of fitting the model was to show that these variables are the ones that will help to distinguish the species.

The bill length and bill depth do a great job of distinguishing the penguin species:

small bill length (and large bill depth) is Adelie
large bill length and large bill depth is Chinstrap
large bill length and small bill depth is Gentoo
this pattern is the same for males and females, but the males are bigger on both measurements than the females. (This last is the reason for having the same scales in both facets, and was also the reason for having you look at just females previously.)

For you, say something sensible about how the graph (for females) is consistent or not with what you found in the previous question.

When I looked at the predicted probabilities, I got the same thing as this except that there, I predicted Gentoo for small bill depth and all bill lengths. (What actually happened is that there really weren’t any small bill lengths and small bill depths in the data, but the prediction had to come up with something, and the values we were predicting for were slightly closer to Gentoo than Adelie.)

Extra: I actually drew the graph first, and then tried to reverse-engineer a sensible prediction question for you.

plot_predictions doesn’t work so easily here because there are two explanatory variables in the model. To show both, because you want to keep colour for the species, you put the other ones on the end to make facets. This chooses five representative values for a quantitative explanatory variable and gives you a facet for each. I also added sex; plot_predictions will use facet_grid rather than facet_wrap when there are two explanatory variables after the thing that is colour. The facets for the first one are in columns and for the second one are in rows. The five different values for bill_depth_mm are hidden in the facet labels, and you can’t really see them, but take it that the plot is using something like a five-number summary:

plot_predictions(penguins.2, condition = c("bill_length_mm", 
                                           "group", "sex", 
                                           "bill_depth_mm"))

Warning: Removed 398 rows containing missing values or values outside the scale range
(`geom_ribbon()`).

The species are actually very distinguishable (as you saw from the plot you made for this part), so most of the predicted probabilities are close to zero or one.² Reading these graphs down the left side (for females):

when bill depth is small, it is a Gentoo
when bill depth is small-to-moderate:
- when bill length is small, it is an Adelie
- when bill length is large, it is a Gentoo
when bill depth is large:
- when bill length is small, it is an Adelie
- when bill length is large, it is a Chinstrap.

The actual definitions of “small” and “large” for bill depth depend on the value of bill length.

This is more or less the same as our graph (the one I had you make), and the pattern is almost exactly the same for males and females, except that for females, the third facet down is the same as the last two, and for males, it is the same as the second one.³

Manchester United

Manchester United is one of the most famous soccer clubs in England and indeed the world. Information about the players (at some point in the past) is here: http://datafiles.ritsokiguess.site/manu.csv (it’s a CSV file). We are going to learn something about the ages of the players.

Read in the file and display some of the resulting data frame.

The usual thing:

my_url <- "http://datafiles.ritsokiguess.site/manu.csv"
man_united <- read_csv(my_url)

Rows: 29 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): name, nationality, date_of_birth, country_of_birth, place_of_birth,...
dbl (2): number, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

man_united

What kind of thing is the column date_of_birth? Create a new column that contains the players’ dates of birth as actual R dates, and display the old dates of birth alongside your new column (or at least the first few rows of them). Save your updated data frame.

The dates of birth are currently text, as evidenced by the chr at the top of the column, or in the read_csv output.

To make these into proper R dates, we need to use something from lubridate. To figure out what, look at the dates as given: they are a day number, a month name and then a year, so that dmy will do the job. (It doesn’t matter that the month is a name; R has a thing called a locale that tells it what language it is working in, and it will get things like month names from there.)

Thus:

man_united %>% 
    mutate(date_of_birth_date = dmy(date_of_birth)) -> mu2
mu2 %>% select(starts_with("date"))

You have a lot of freedom here in terms of choices for things. I thought it was easier to save the new data frame first, then look at the appropriate columns, but if you have something else that works, it’s good.

The new column is correctly a date, and is displayed in ISO format with the year first, but you can check that, at least for the ones you see, the actual dates in both formats are the same.

Extra: the dates of birth are read in as text because they are not in the format year-month-day where everything is numbers, which is the only format that will get turned into dates automatically.

Treating the new column of dates as quantitative, make a suitable plot of these with the players’ position on the field. What do you see on the quantitative axis?

The positions are text, and if you look at the data, they are one of these:

mu2 %>% count(position)

So it makes sense to treat as categorical, and hence to draw a boxplot. You might be curious about what will happen:

ggplot(mu2, aes(x=position, y=date_of_birth_date)) + geom_boxplot()

The \(y\)-axis is marked off in years, that is, on a date scale. (If the players had all been born in the same year, you would have seen months here as well.)

Is there a position where the players tend to be older? Explain briefly.

The oldest players will have had the earliest birth dates, that is, they will be at the bottom of the plot. Thus, the goalkeepers tend to be older than the players in other positions.

If you know anything about soccer, you’ll know that it is generally good to have an experienced goalkeeper, someone who knows how to be in the right place so that they don’t have to move very far (or who knows what an opposing attacker is likely to do).

In this data set, the attackers are youngest, perhaps because you want someone fast playing up front.

(This is to prepare you for the next thing.) Work out how many years old you are by using something like as.Date("2016-05-21") to turn your birth date into an R date, create a period from the interval from it to now (use today() to get today’s date), and pull out the number of (completed) years. (If you don’t want to share your birth date, use any other date. I’m not checking.) Make sure to have the right number of brackets in the right places.

Here’s the one-step solution, using the date above (which is the last-but-one date on which Manchester United won the FA Cup; the last time was in 2024):

year(as.period(as.Date("2016-05-21") %--% Sys.Date()))

[1] 9

If you don’t like that, build it up in steps, putting brackets around the appropriate things, and copying and pasting each line into the next one:

as.Date("2016-05-21")

[1] "2016-05-21"

today()

[1] "2026-01-22"

as.period(as.Date("2016-05-21") %--% today())

[1] "9y 8m 1d 0H 0M 0S"

year(as.period(as.Date("2016-05-21") %--% today()))

[1] 9

These are, respectively:

the date of the cup final win, as a date
“today” (when I last ran this) as a date
the time between then and now as a period (note how it displays)
the number of years since then. This was (at the time of writing) 9 years and some number of months ago.

I realize that this also works as a pipe:

as.Date("2016-05-21") %--% today() %>%
    as.period() %>%
    year()

[1] 9

but it’s better for what we’ll be doing (in a moment) to cope with all the brackets.

Extra: any other way of turning your birthday from text into a date is also fine, such as

mdy("May 21, 2016")

[1] "2016-05-21"

in place of the as.Date, but the focus of this question was making a period and extracting the number of years from it, so it is not important how you turn your birthday into a date.

Go back to the Manchester United players. Calculate a new column containing the age, in completed years, of each player as measured today (thus, for example, a player who is currently 29 years and some number of days old should be counted as 29 years old, even if the number of days is something like 364). Display your new column side by side with the one called age. (This uses the same technique that you used to calculate your own age in the previous part, except that you don’t need anything like as.Date because you converted the birth dates into R Dates in an earlier part.)

The reason for doing the previous part was to give you hints for this one.

I am just going to display my result, since I am going to add to my pipeline in a moment:

mu2 %>% mutate(
            my_age = year(as.period(date_of_birth_date %--% today()))) %>%
    select(age, my_age)

The calculation of what I called my_age is rather lengthy; you might need to split the line so that you see it all, or you could even define a function to do the heavy lifting so that the mutate is shorter.

What does the previous question tell you about when I originally downloaded the data? (It is reasonable to assume that the ages in the original data were correct on the date I downloaded the data.)

The players’ ages as calculated by me and the players’ ages as given in the original data differ by about 6 years, by eyeballing the answer to the previous question, but we can be a bit more careful:

mu2 %>% mutate(
            my_age = year(as.period(date_of_birth_date %--% today()))) %>%
    select(age, my_age) %>% 
  mutate(age_difference = my_age - age) %>% 
  count(age_difference)

so the short answer is “between 6 and 7 years ago”. (Three of the players are seven years older today than they were when the data were collected.)

Extra: we can get a more precise answer about the original download date by guessing a download date, using that instead of today in the code above, and seeing how many players have the “wrong” age based on the guessed download date. I might need to make several guesses, so first I am going to write a function to display the players with the “wrong” ages and their birthdates:

find_wrong_ages <- function(guessed_download_date, mu_df) {
  mu_df %>% 
    mutate(my_age = year(as.period(date_of_birth_date %--% guessed_download_date))) %>%
    select(name, date_of_birth, age, my_age) %>% 
    filter(age != my_age)
}

Let’s try this with today (so that every player should have the wrong age):

find_wrong_ages(today(), mu2)

That seems to have worked. So now let’s try a date about seven years ago:

find_wrong_ages(ymd("2019-01-22"), mu2)

Most of the players (26 out of 29) have a calculated age that is a year too young, so the actual download date must be later than the one we tried. Let’s try October:

find_wrong_ages(ymd("2019-10-22"), mu2)

And now we’re getting closer. Note now these birthdays are between October 31 and December 8, so if we go to a date after McTominay’s birthday, we should be right:

find_wrong_ages(ymd("2019-12-09"), mu2)

There are no wrong ages now, so the ages are consistent with the download having been on Dec 9. But it might actually have been a little later than that:

find_wrong_ages(ymd("2020-01-01"), mu2)

It could, in fact, have been anywhere between Dec 9 and Dec 14, to make sure that Lingard’s age is correct:

find_wrong_ages(ymd("2019-12-14"), mu2)

I don’t remember the exact date, but I had either just finished or was just about to start marking the C32 final exam that year, so I was putting some problems together for D29, including this one.

Watching the NBA

The NBA (National Basketball Association) runs North America’s major basketball league, whose games are played from October to April. The 2023-2024 schedule is at http://ritsokiguess.site/datafiles/nba_sched.csv. (This question came from an assignment where this was the current season.)

Read in and display some of the data.

As you well know by now:

my_url <- "http://ritsokiguess.site/datafiles/nba_sched.csv"
nba <- read_csv(my_url)

Rows: 1200 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): date, road, home, start_eastern, TV

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nba

Extra: there was originally a long story here about how I got the data into this format for you (it was a PDF file), but the original link has disappeared, so the long story has too.

NBA games are played on different days of the week. Which day of the week has the most games, and which day of the week has the fewest? Use the tools you saw in lecture, in this course and STAC32, to work this out. (Hint: does it seem to matter that the year only has two digits here?)

Before you start coding, do some thinking:

to get the days of the week, you can use wday.
to use wday, you need actual dates, so you have to make those first. These are American-style dates, with the month first, so we will need mdy for this.
once you have worked out the day of week of each game, you can use count to count them up, and arrange to put them in order.

That leads to something like this:

nba %>% 
  mutate(date1 = mdy(date)) %>% 
  mutate(dow = wday(date1, label = TRUE)) %>% 
  count(dow) %>% 
  arrange(desc(n))

Wednesday is the most common day for a game, and Thursday is the least common.

To pick up on the hint: let’s run the first mutate again:

nba %>% 
  mutate(date1 = mdy(date)) %>% 
  select(date, date1)

mdy inferred that the dates were in the 21st century, so that the two-digit years were actually all right. (It would be smart to check this before you get too far into the question.)

Extra: I was actually surprised by the Wednesday thing; I was expecting a weekend day to be most common. I used to follow the NHL (back in the days of Wayne Gretzky as a player, which shows you how old I am). As I remember, there were most games on a Saturday, and fewest on a Monday, and that was the sort of thing I was expecting to see here. My guess is that a lot of people who follow the NBA also follow college basketball, many of whose games are played on Saturdays.

You have a friend who lives in Auckland, New Zealand, who is a big basketball fan. They have a streaming package that enables them to watch any NBA game live. They get home from work at 4:00pm (local time) every day. What are some games they would have been able to watch from start to finish as they happen? Use tools we have seen in lecture to find this out. Hints below:

your friend needs to know about games that start at 4:00pm or later Auckland time (16:00 or later).
you can use unite to glue a date and time together as text
if your time does not have seconds, omit the s in the appropriate function
when you create a date-time that needs to be in a certain timezone, add the timezone when you create it
America/Toronto will do for Eastern time; use OlsonNames() to get a list of all the time zone names that R knows about. (The output from OlsonNames() is long, so just find what you need and use that.)

Again, thinking first:

we are talking time zones, so we need actual date-times with timezones.
the unite hint above says how to make a date-time as text
use something like ymd_hms to make an actual date-time. Here, the months are first and the years last, and there are no seconds, so use mdy_hm instead. (All these permutations exist.)
add a timezone when you are creating the date-times (Eastern time), otherwise you’ll get the UTC timezone.
find out using OlsonNames() that the time zone that Auckland, NZ is in is called Pacific/Auckland (down near the bottom; there is also a timezone called simply NZ that should also work. New Zealand as a whole has only one timezone.)
use with_tz to move a date-time from one timezone to another.

I know the start of a basketball game as “tip” or “tip-off”, hence my names for the columns below. Something like this is what you need:

nba %>% unite(tip_text, date, start_eastern) %>% 
  mutate(tip = mdy_hm(tip_text, tz = "America/Toronto")) %>% 
  mutate(tip_nz = with_tz(tip, "Pacific/Auckland"))

If your friend has the patience, they can scroll down through 1200 rows and find the ones that start at or after 16:00 in the tip_nz column. But we can do better than that: pull out the hour from tip_nz, and grab the ones that are 16 or more:

nba %>% unite(tip_text, date, start_eastern) %>% 
  mutate(tip = mdy_hm(tip_text, tz = "America/Toronto")) %>% 
  mutate(tip_nz = with_tz(tip, "Pacific/Auckland")) %>% 
  mutate(tip_nz_hour = hour(tip_nz)) %>% 
  filter(tip_nz_hour >= 16)

Thus, for example, Golden State at Denver and Portland at Sacramento on Nov 8, LA Lakers at Phoenix and Oklahoma City at Sacramento on Nov 10, and so on. Say something like this. However you find them is good, but the way I just did it is best. There are 132 games your friend can watch after they get home from work, some of which are at the same time (so they will have to choose which one to watch live).

Extra 1: As you page through the games, you will see that they are all on the west coast (Pacific time zone). This is because the time difference between North America and New Zealand is such that all the other, earlier, games are on while your friend is at work. (You might argue that a game that starts at 6:00am Auckland time is also one your friend might be able to watch, if they get up early enough, because it would be before they go to work.)

Extra 2: If you look at the columns I called tip and tip_nz (in the dataframe with all 1200 games), you can see that at the start, they are 17 hours apart and all of the games are too early for your friend to watch (even the ones that start at 10:00pm Eastern). Once you get into November, the time difference becomes 18 hours, because daylight savings ends in November in North America. Near the end of the dataframe, on March 10, the difference again becomes 17 hours, because daylight saving time starts again in North America, but on April 6, it changes again to 16 hours. Why? Because New Zealand’s daylight saving time ends (it is at this point fall in New Zealand).⁴

What that means is that the NBA games start late enough for your Auckland friend to watch only when there is an 18 hour time difference, because only then do the 10:00pm Eastern (7:00pm Pacific) games match up to 4:00pm the next day in Auckland. When the time difference is less than that, the 10:00pm Eastern games start earlier than 4:00pm in Auckland.

I don’t know about you, but I’m quite glad to have with_tz handle all of that for us, and get an answer for your friend that actually would have worked.

Footnotes

For example, add up the males and females, or add up the numbers of each species.↩︎
This is why I had to work so hard to have you find some predictions that were not close to 0 or 1.↩︎
Look carefully to see which of those two close-together traces, because of the scale, is green and which one is blue.↩︎
The other end of NZ’s daylight saving time is in September, before the NBA season starts. That is when it starts, because at that point it is spring in NZ.↩︎