library(tidyverse)
library(marginaleffects)
library(nnet)
# library(MASS, exclude = "select")Worksheet 4
Packages
The first problem is a repeat from Worksheet 3, in case you didn’t do it before. The other problems are on dates and times.
Palmer penguins
The penguins dataset in the package palmerpenguins contains body measurements and other information for 344 adult penguins that were observed at the Palmer Archipelago in Antarctica. Variables of interest to us are:
species: the species of the penguin (one of Adelie, Chinstrap, Gentoo)bill_length_mm: bill length in mmbill_depth_mm: bill depth (from top to bottom) in mmflipper_length_mm: flipper length in mmbody_mass_g: body mass in gramssex: whether the penguin ismaleorfemale
In particular, is it possible to predict the species of a penguin from the other variables?
- Load the package (installing it first if you need to) and display the dataset.
That means this:
library(palmerpenguins)
Attaching package: 'palmerpenguins'
The following objects are masked from 'package:datasets':
penguins, penguins_raw
penguins- There are some missing values in the dataset. Remove all the observations that have missing values on any of the variables, and save the resulting dataset.
This is what drop_na does. You can either feed it some columns to check for missings in, or if you feed it no columns at all, it will check for missings everywhere. I saved it back into the original dataset, but you can certainly create a new one if you prefer:
penguins %>% drop_na() -> penguinsIf you have another way that is consistent with the way we have learned things, that is also good. (Ideas from base R are not consistent with our approach.)
Extra: to convince myself that all the missing values are indeed gone:
summary(penguins) species island bill_length_mm bill_depth_mm
Adelie :146 Biscoe :163 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :123 1st Qu.:39.50 1st Qu.:15.60
Gentoo :119 Torgersen: 47 Median :44.50 Median :17.30
Mean :43.99 Mean :17.16
3rd Qu.:48.60 3rd Qu.:18.70
Max. :59.60 Max. :21.50
flipper_length_mm body_mass_g sex year
Min. :172 Min. :2700 female:165 Min. :2007
1st Qu.:190 1st Qu.:3550 male :168 1st Qu.:2007
Median :197 Median :4050 Median :2008
Mean :201 Mean :4207 Mean :2008
3rd Qu.:213 3rd Qu.:4775 3rd Qu.:2009
Max. :231 Max. :6300 Max. :2009
and none of the columns do indeed have any missing values remaining in them. There appear to be 333 penguins left,1 so we had to get rid of 11 of them.
- Why is
polrfromMASSnot appropriate for fitting a model to predict species from the other variables?
polr assumes that the categorical response variable species has ordered categories, but in this case the species are just names or labels, and there is no (unambiguous) way to put them in order (such as from “low” to “high”).
Hence, looking just ahead, we will need to use multinom from the nnet package to fit our models. Make sure you have nnet installed.
- Fit a suitable model predicting species from the other variables (the fitting process will produce some output).
From the previous question, we have ruled out polr, so:
penguins.1 <- multinom(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm +
body_mass_g + sex, data = penguins)# weights: 21 (12 variable)
initial value 365.837892
iter 10 value 16.719027
iter 20 value 3.047829
iter 30 value 0.103484
iter 40 value 0.063985
iter 50 value 0.006201
iter 60 value 0.003924
iter 70 value 0.001733
iter 80 value 0.001547
iter 90 value 0.001513
iter 100 value 0.001498
final value 0.001498
stopped after 100 iterations
Extra: should you be curious:
summary(penguins.1)Call:
multinom(formula = species ~ bill_length_mm + bill_depth_mm +
flipper_length_mm + body_mass_g + sex, data = penguins)
Coefficients:
(Intercept) bill_length_mm bill_depth_mm flipper_length_mm
Chinstrap -103.422229 16.51642 -24.41758 -0.31003929
Gentoo -6.756321 12.21924 -30.09757 -0.08758996
body_mass_g sexmale
Chinstrap -0.030200928 -15.820607
Gentoo 0.005019684 -8.132101
Std. Errors:
(Intercept) bill_length_mm bill_depth_mm flipper_length_mm
Chinstrap 0.5715154 27.749774 27.43640 4.097365
Gentoo 0.5715734 8.968787 33.25608 11.905692
body_mass_g sexmale
Chinstrap 0.1350169 3.077521
Gentoo 0.5848602 3.498288
Residual Deviance: 0.002996934
AIC: 24.003
This doesn’t really tell us very much. The coefficients are kind of slopes (treating the first category Adelie as baseline), and there are some log-odds hiding in the background, but for knowing things of actual interest to us, like “can we remove anything from the model?”, this is not what to look at.
- Use
stepto see whether anything can be removed from your model. Which variables remain afterstephas finished? (Hints: save the output fromstep, because this is actually the best model thatstepfound. Also, the fitting process will probably produce a lot of output, which, for this question, you can include in your assignment.)
First off, you are probably thinking drop1, but remember that this doesn’t work for these models (no, I don’t know why not either):
drop1(penguins.1, test = "Chisq")trying - bill_length_mm
Error in if (trace) {: argument is not interpretable as logical
so step it is:
penguins.2 <- step(penguins.1)Start: AIC=24
species ~ bill_length_mm + bill_depth_mm + flipper_length_mm +
body_mass_g + sex
trying - bill_length_mm
# weights: 18 (10 variable)
initial value 365.837892
iter 10 value 129.185715
iter 20 value 114.018556
iter 30 value 113.336460
iter 40 value 113.216108
final value 113.215886
converged
trying - bill_depth_mm
# weights: 18 (10 variable)
initial value 365.837892
iter 10 value 66.463118
iter 20 value 12.514139
iter 30 value 7.743531
iter 40 value 5.662963
iter 50 value 4.350499
iter 60 value 4.145847
iter 70 value 4.093510
iter 80 value 3.984724
iter 90 value 3.903126
iter 100 value 3.831010
final value 3.831010
stopped after 100 iterations
trying - flipper_length_mm
# weights: 18 (10 variable)
initial value 365.837892
iter 10 value 25.561758
iter 20 value 1.350538
iter 30 value 0.065803
iter 40 value 0.020499
iter 50 value 0.016911
iter 60 value 0.012712
iter 70 value 0.005348
iter 80 value 0.001290
iter 90 value 0.000225
iter 100 value 0.000216
final value 0.000216
stopped after 100 iterations
trying - body_mass_g
# weights: 18 (10 variable)
initial value 365.837892
iter 10 value 9.579038
iter 20 value 4.114389
iter 30 value 2.340467
iter 40 value 0.614487
iter 50 value 0.055642
iter 60 value 0.012303
iter 70 value 0.007327
iter 80 value 0.007163
iter 90 value 0.007015
iter 100 value 0.006995
final value 0.006995
stopped after 100 iterations
trying - sex
# weights: 18 (10 variable)
initial value 365.837892
iter 10 value 16.321214
iter 20 value 3.754897
iter 30 value 1.631859
iter 40 value 0.012427
iter 50 value 0.001125
iter 60 value 0.001108
iter 70 value 0.001006
iter 80 value 0.000906
iter 90 value 0.000498
iter 100 value 0.000498
final value 0.000498
stopped after 100 iterations
Df AIC
- flipper_length_mm 10 20.00043
- sex 10 20.00100
- body_mass_g 10 20.01399
<none> 12 24.00300
- bill_depth_mm 10 27.66202
- bill_length_mm 10 246.43177
# weights: 18 (10 variable)
initial value 365.837892
iter 10 value 25.561758
iter 20 value 1.350538
iter 30 value 0.065803
iter 40 value 0.020499
iter 50 value 0.016911
iter 60 value 0.012712
iter 70 value 0.005348
iter 80 value 0.001290
iter 90 value 0.000225
iter 100 value 0.000216
final value 0.000216
stopped after 100 iterations
Step: AIC=20
species ~ bill_length_mm + bill_depth_mm + body_mass_g + sex
trying - bill_length_mm
# weights: 15 (8 variable)
initial value 365.837892
iter 10 value 136.772749
iter 20 value 133.782638
iter 30 value 133.545483
final value 133.538635
converged
trying - bill_depth_mm
# weights: 15 (8 variable)
initial value 365.837892
iter 10 value 30.179445
iter 20 value 5.699489
iter 30 value 5.175455
iter 40 value 5.153196
iter 50 value 5.070255
iter 60 value 4.997789
iter 70 value 4.893410
iter 80 value 4.826209
iter 90 value 4.708542
iter 100 value 4.695168
final value 4.695168
stopped after 100 iterations
trying - body_mass_g
# weights: 15 (8 variable)
initial value 365.837892
iter 10 value 11.953428
iter 20 value 3.975600
iter 30 value 2.172365
iter 40 value 2.111553
iter 50 value 1.881400
iter 60 value 1.236767
iter 70 value 1.133364
iter 80 value 1.059253
iter 90 value 0.588576
iter 100 value 0.490499
final value 0.490499
stopped after 100 iterations
trying - sex
# weights: 15 (8 variable)
initial value 365.837892
iter 10 value 20.098944
iter 20 value 3.621850
iter 30 value 1.160548
iter 40 value 1.108738
iter 50 value 1.063387
iter 60 value 0.958674
iter 70 value 0.926099
iter 80 value 0.837719
iter 90 value 0.797644
iter 100 value 0.741759
final value 0.741759
stopped after 100 iterations
Df AIC
- body_mass_g 8 16.98100
- sex 8 17.48352
<none> 10 20.00043
- bill_depth_mm 8 25.39034
- bill_length_mm 8 283.07727
# weights: 15 (8 variable)
initial value 365.837892
iter 10 value 11.953428
iter 20 value 3.975600
iter 30 value 2.172365
iter 40 value 2.111553
iter 50 value 1.881400
iter 60 value 1.236767
iter 70 value 1.133364
iter 80 value 1.059253
iter 90 value 0.588576
iter 100 value 0.490499
final value 0.490499
stopped after 100 iterations
Step: AIC=16.98
species ~ bill_length_mm + bill_depth_mm + sex
trying - bill_length_mm
# weights: 12 (6 variable)
initial value 365.837892
iter 10 value 150.109767
iter 20 value 142.419807
iter 30 value 142.047323
final value 142.041293
converged
trying - bill_depth_mm
# weights: 12 (6 variable)
initial value 365.837892
iter 10 value 139.953898
iter 20 value 133.697479
iter 30 value 133.399000
iter 40 value 133.326376
final value 133.325373
converged
trying - sex
# weights: 12 (6 variable)
initial value 365.837892
iter 10 value 26.650783
iter 20 value 23.943597
iter 30 value 23.916873
iter 40 value 23.901339
iter 50 value 23.895442
iter 60 value 23.894251
final value 23.892065
converged
Df AIC
<none> 8 16.98100
- sex 6 59.78413
- bill_depth_mm 6 278.65075
- bill_length_mm 6 296.08259
This is where you get rather a lot of output. step has to check the effect of removing each explanatory variable from each model it fits, and multinom itself produces half a dozen or so lines of output each time it is run.
To see which variables are left, either read them off from the output here, or look at summary of your new model just enough to see which variables it has in it:
summary(penguins.2)Call:
multinom(formula = species ~ bill_length_mm + bill_depth_mm +
sex, data = penguins)
Coefficients:
(Intercept) bill_length_mm bill_depth_mm sexmale
Chinstrap -242.9740 14.88938 -21.91411 -36.9051125
Gentoo 121.0476 14.80645 -44.69711 -0.1301506
Std. Errors:
(Intercept) bill_length_mm bill_depth_mm sexmale
Chinstrap 3.985819 9.369182 23.16685 23.33386
Gentoo 4.071031 9.413971 23.17697 43.25185
Residual Deviance: 0.9809986
AIC: 16.981
Whichever way you slice it, bill length, bill depth, and sex remain; flipper length and body mass have been removed.
For the predictions below, use this model.
- Obtain predicted probabilities of each species from your best model for all nine combinations of: bill length 39.6, 40.0, and 40.4, bill depth 15.8, 16.0, and 16.2, sex female (only). Display those predicted probabilities in a way that allows for easy comparison of the three species probabilities for given values of the other variables.
Three stages, as we have seen before (a point each):
- make a dataframe (perhaps called
new) of all the combination of explanatory variables - do the predictions
- rearrange the predictions so that they are easy to compare.
For the first step: datagrid does all possible combinations, so that is the way to go. When there are a lot of things that you want combinations of, it can be clearer to define them first. The best model is the one output by step, which I called penguins.2:
lengths <- c(39.6, 40.0, 40.4)
depths <- c(15.8, 16.0, 16.2)
sexes <- "female"
new <- datagrid(model = penguins.2, bill_length_mm = lengths, bill_depth_mm = depths, sex = sexes)
newThere are \(3 \times 3 \times 1 = 3^2 = 9\) combinations of values. My habit, when I define the values first like this, is to use a plural name for the values, so that when I construct new, it’s “singular = plural” all the way through.
If you prefer to put the actual values inside the datagrid, you can do that, but then you will need to split the line of code so that it is easier to see all the values, for example a new line for each variable, like this:
new0 <- datagrid(model = penguins.2,
bill_length_mm = c(39.6, 40.0, 40.4),
bill_depth_mm = c(15.8, 16.0, 16.2),
sex = "female")
new0Doing it this way, with each variable on a new line, makes it easier to read.
Second, the predictions. Save the explanatory variable names, the column group that says which species the predicted probability is for, and the column estimate of the predicted probabilities themselves:
cbind(predictions(model = penguins.2, newdata = new)) %>%
select(group, estimate, bill_length_mm:sex)Third, pivot-wider the estimated probabilities into columns named for the species. Mine came out wider than my display, so I additionally rounded off the probabilities to make the columns narrower:
cbind(predictions(model = penguins.2, newdata = new)) %>%
select(group, estimate, bill_length_mm:sex) %>%
pivot_wider(names_from = group, values_from = estimate) %>%
mutate(across(Adelie:Gentoo, \(x) round(x, 3))) # for each column from Adelie to Gentoo, round it to 3 decimalsA second way to do this is to round off the predicted probabilities before you pivot-wider, which is easier to do but harder to see that you can do it this way. Run the code as far as the select to figure out how you need to do it: there’s a column called estimate with all the predicted probabilities in it, so round this column to 3 decimals before you proceed with the pivot_wider:
cbind(predictions(model = penguins.2, newdata = new)) %>%
select(group, estimate, bill_length_mm:sex) %>%
mutate(estimate = round(estimate, 3)) %>%
pivot_wider(names_from = group, values_from = estimate) Extra: I had to play around a bit with the numbers I wanted you to get predictions for. The problem is that the species are actually very distinct (as you will see with your graph), and so if you are not careful with your values, you will get a lot of predicted probabilities that are close to zero and one. I ended up restricting things to just females for this question because (as it turns out) males have a similar pattern but are just bigger overall, and the pattern was easier to see with just one sex.
- Using your predictions, what combinations of values for bill length and bill depth distinguish the three species?
Find the largest predicted probabilities for each species and say something about whether they go with small or large values on each variable:
- Adelie have small bill length and large bill depth
- Chinstrap have large bill length and medium or large bill depth
- Gentoo have small bill depth (but can have any bill length).
Another way you can go is to look at the combinations of large and small values on each explanatory variable:
- large bill length and large bill depth is probably Chinstrap
- large bill length and small bill depth is probably Gentoo
- small bill length and large bill depth is (almost certainly) Adelie
- small bill length and small bill depth is probably Gentoo (there actually aren’t really any of these, as we’ll see from a graph shortly).
- Make a plot that includes bill length, bill depth, species, and sex. Does your plot support your conclusions from the previous question?
This plot is one you would make from the data, not of the predictions from the model (which is rather hard to interpret).
This has two quantitative variables (the first two) and two categorical (the last two). If you only had one categorical variable, this would be a scatterplot with the levels of the categorical variable as colour. The procedure we have seen for dealing with “too many” categorical variables is to use facets. My take is that sex is explanatory, so that is better as facets, and hence:
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, colour = species)) +
geom_point() + facet_wrap(~ sex)I am deliberately not using scales = "free" because I want to be able to compare where the points are on the two facets.
I have to say that I think these graphs are actually the best way towards understanding here. The benefit of fitting the model was to show that these variables are the ones that will help to distinguish the species.
The bill length and bill depth do a great job of distinguishing the penguin species:
- small bill length (and large bill depth) is Adelie
- large bill length and large bill depth is Chinstrap
- large bill length and small bill depth is Gentoo
- this pattern is the same for males and females, but the males are bigger on both measurements than the females. (This last is the reason for having the same scales in both facets, and was also the reason for having you look at just females previously.)
For you, say something sensible about how the graph (for females) is consistent or not with what you found in the previous question.
When I looked at the predicted probabilities, I got the same thing as this except that there, I predicted Gentoo for small bill depth and all bill lengths. (What actually happened is that there really weren’t any small bill lengths and small bill depths in the data, but the prediction had to come up with something, and the values we were predicting for were slightly closer to Gentoo than Adelie.)
Extra: I actually drew the graph first, and then tried to reverse-engineer a sensible prediction question for you.
plot_predictions doesn’t work so easily here because there are two explanatory variables in the model. To show both, because you want to keep colour for the species, you put the other ones on the end to make facets. This chooses five representative values for a quantitative explanatory variable and gives you a facet for each. I also added sex; plot_predictions will use facet_grid rather than facet_wrap when there are two explanatory variables after the thing that is colour. The facets for the first one are in columns and for the second one are in rows. The five different values for bill_depth_mm are hidden in the facet labels, and you can’t really see them, but take it that the plot is using something like a five-number summary:
plot_predictions(penguins.2, condition = c("bill_length_mm",
"group", "sex",
"bill_depth_mm"))Warning: Removed 398 rows containing missing values or values outside the scale range
(`geom_ribbon()`).
The species are actually very distinguishable (as you saw from the plot you made for this part), so most of the predicted probabilities are close to zero or one.2 Reading these graphs down the left side (for females):
- when bill depth is small, it is a Gentoo
- when bill depth is small-to-moderate:
- when bill length is small, it is an Adelie
- when bill length is large, it is a Gentoo
- when bill depth is large:
- when bill length is small, it is an Adelie
- when bill length is large, it is a Chinstrap.
The actual definitions of “small” and “large” for bill depth depend on the value of bill length.
This is more or less the same as our graph (the one I had you make), and the pattern is almost exactly the same for males and females, except that for females, the third facet down is the same as the last two, and for males, it is the same as the second one.3
Manchester United
Manchester United is one of the most famous soccer clubs in England and indeed the world. Information about the players (at some point in the past) is here: http://datafiles.ritsokiguess.site/manu.csv (it’s a CSV file). We are going to learn something about the ages of the players.
- Read in the file and display some of the resulting data frame.
The usual thing:
my_url <- "http://datafiles.ritsokiguess.site/manu.csv"
man_united <- read_csv(my_url)Rows: 29 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): name, nationality, date_of_birth, country_of_birth, place_of_birth,...
dbl (2): number, age
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
man_united- What kind of thing is the column
date_of_birth? Create a new column that contains the players’ dates of birth as actual R dates, and display the old dates of birth alongside your new column (or at least the first few rows of them). Save your updated data frame.
The dates of birth are currently text, as evidenced by the chr at the top of the column, or in the read_csv output.
To make these into proper R dates, we need to use something from lubridate. To figure out what, look at the dates as given: they are a day number, a month name and then a year, so that dmy will do the job. (It doesn’t matter that the month is a name; R has a thing called a locale that tells it what language it is working in, and it will get things like month names from there.)
Thus:
man_united %>%
mutate(date_of_birth_date = dmy(date_of_birth)) -> mu2
mu2 %>% select(starts_with("date"))You have a lot of freedom here in terms of choices for things. I thought it was easier to save the new data frame first, then look at the appropriate columns, but if you have something else that works, it’s good.
The new column is correctly a date, and is displayed in ISO format with the year first, but you can check that, at least for the ones you see, the actual dates in both formats are the same.
Extra: the dates of birth are read in as text because they are not in the format year-month-day where everything is numbers, which is the only format that will get turned into dates automatically.
- Treating the new column of dates as quantitative, make a suitable plot of these with the players’
positionon the field. What do you see on the quantitative axis?
The positions are text, and if you look at the data, they are one of these:
mu2 %>% count(position)So it makes sense to treat as categorical, and hence to draw a boxplot. You might be curious about what will happen:
ggplot(mu2, aes(x=position, y=date_of_birth_date)) + geom_boxplot()The \(y\)-axis is marked off in years, that is, on a date scale. (If the players had all been born in the same year, you would have seen months here as well.)
- Is there a position where the players tend to be older? Explain briefly.
The oldest players will have had the earliest birth dates, that is, they will be at the bottom of the plot. Thus, the goalkeepers tend to be older than the players in other positions.
If you know anything about soccer, you’ll know that it is generally good to have an experienced goalkeeper, someone who knows how to be in the right place so that they don’t have to move very far (or who knows what an opposing attacker is likely to do).
In this data set, the attackers are youngest, perhaps because you want someone fast playing up front.
- (This is to prepare you for the next thing.) Work out how many years old you are by using something like
as.Date("2016-05-21")to turn your birth date into an R date, create aperiodfrom the interval from it to now (usetoday()to get today’s date), and pull out the number of (completed) years. (If you don’t want to share your birth date, use any other date. I’m not checking.) Make sure to have the right number of brackets in the right places.
Here’s the one-step solution, using the date above (which is the last-but-one date on which Manchester United won the FA Cup; the last time was in 2024):
year(as.period(as.Date("2016-05-21") %--% Sys.Date()))[1] 9
If you don’t like that, build it up in steps, putting brackets around the appropriate things, and copying and pasting each line into the next one:
as.Date("2016-05-21")[1] "2016-05-21"
today()[1] "2026-01-22"
as.period(as.Date("2016-05-21") %--% today())[1] "9y 8m 1d 0H 0M 0S"
year(as.period(as.Date("2016-05-21") %--% today()))[1] 9
These are, respectively:
- the date of the cup final win, as a date
- “today” (when I last ran this) as a date
- the time between then and now as a period (note how it displays)
- the number of years since then. This was (at the time of writing) 9 years and some number of months ago.
I realize that this also works as a pipe:
as.Date("2016-05-21") %--% today() %>%
as.period() %>%
year()[1] 9
but it’s better for what we’ll be doing (in a moment) to cope with all the brackets.
Extra: any other way of turning your birthday from text into a date is also fine, such as
mdy("May 21, 2016")[1] "2016-05-21"
in place of the as.Date, but the focus of this question was making a period and extracting the number of years from it, so it is not important how you turn your birthday into a date.
- Go back to the Manchester United players. Calculate a new column containing the age, in completed years, of each player as measured today (thus, for example, a player who is currently 29 years and some number of days old should be counted as 29 years old, even if the number of days is something like 364). Display your new column side by side with the one called
age. (This uses the same technique that you used to calculate your own age in the previous part, except that you don’t need anything likeas.Datebecause you converted the birth dates into RDates in an earlier part.)
The reason for doing the previous part was to give you hints for this one.
I am just going to display my result, since I am going to add to my pipeline in a moment:
mu2 %>% mutate(
my_age = year(as.period(date_of_birth_date %--% today()))) %>%
select(age, my_age)The calculation of what I called my_age is rather lengthy; you might need to split the line so that you see it all, or you could even define a function to do the heavy lifting so that the mutate is shorter.
- What does the previous question tell you about when I originally downloaded the data? (It is reasonable to assume that the ages in the original data were correct on the date I downloaded the data.)
The players’ ages as calculated by me and the players’ ages as given in the original data differ by about 6 years, by eyeballing the answer to the previous question, but we can be a bit more careful:
mu2 %>% mutate(
my_age = year(as.period(date_of_birth_date %--% today()))) %>%
select(age, my_age) %>%
mutate(age_difference = my_age - age) %>%
count(age_difference)so the short answer is “between 6 and 7 years ago”. (Three of the players are seven years older today than they were when the data were collected.)
Extra: we can get a more precise answer about the original download date by guessing a download date, using that instead of today in the code above, and seeing how many players have the “wrong” age based on the guessed download date. I might need to make several guesses, so first I am going to write a function to display the players with the “wrong” ages and their birthdates:
find_wrong_ages <- function(guessed_download_date, mu_df) {
mu_df %>%
mutate(my_age = year(as.period(date_of_birth_date %--% guessed_download_date))) %>%
select(name, date_of_birth, age, my_age) %>%
filter(age != my_age)
}Let’s try this with today (so that every player should have the wrong age):
find_wrong_ages(today(), mu2)That seems to have worked. So now let’s try a date about seven years ago:
find_wrong_ages(ymd("2019-01-22"), mu2)Most of the players (26 out of 29) have a calculated age that is a year too young, so the actual download date must be later than the one we tried. Let’s try October:
find_wrong_ages(ymd("2019-10-22"), mu2)And now we’re getting closer. Note now these birthdays are between October 31 and December 8, so if we go to a date after McTominay’s birthday, we should be right:
find_wrong_ages(ymd("2019-12-09"), mu2)There are no wrong ages now, so the ages are consistent with the download having been on Dec 9. But it might actually have been a little later than that:
find_wrong_ages(ymd("2020-01-01"), mu2)It could, in fact, have been anywhere between Dec 9 and Dec 14, to make sure that Lingard’s age is correct:
find_wrong_ages(ymd("2019-12-14"), mu2)I don’t remember the exact date, but I had either just finished or was just about to start marking the C32 final exam that year, so I was putting some problems together for D29, including this one.
Watching the NBA
The NBA (National Basketball Association) runs North America’s major basketball league, whose games are played from October to April. The 2023-2024 schedule is at http://ritsokiguess.site/datafiles/nba_sched.csv. (This question came from an assignment where this was the current season.)
- Read in and display some of the data.
As you well know by now:
my_url <- "http://ritsokiguess.site/datafiles/nba_sched.csv"
nba <- read_csv(my_url)Rows: 1200 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): date, road, home, start_eastern, TV
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nbaExtra: there was originally a long story here about how I got the data into this format for you (it was a PDF file), but the original link has disappeared, so the long story has too.
- NBA games are played on different days of the week. Which day of the week has the most games, and which day of the week has the fewest? Use the tools you saw in lecture, in this course and STAC32, to work this out. (Hint: does it seem to matter that the year only has two digits here?)
Before you start coding, do some thinking:
- to get the days of the week, you can use
wday. - to use
wday, you need actual dates, so you have to make those first. These are American-style dates, with the month first, so we will needmdyfor this. - once you have worked out the day of week of each game, you can use
countto count them up, andarrangeto put them in order.
That leads to something like this:
nba %>%
mutate(date1 = mdy(date)) %>%
mutate(dow = wday(date1, label = TRUE)) %>%
count(dow) %>%
arrange(desc(n))Wednesday is the most common day for a game, and Thursday is the least common.
To pick up on the hint: let’s run the first mutate again:
nba %>%
mutate(date1 = mdy(date)) %>%
select(date, date1)mdy inferred that the dates were in the 21st century, so that the two-digit years were actually all right. (It would be smart to check this before you get too far into the question.)
Extra: I was actually surprised by the Wednesday thing; I was expecting a weekend day to be most common. I used to follow the NHL (back in the days of Wayne Gretzky as a player, which shows you how old I am). As I remember, there were most games on a Saturday, and fewest on a Monday, and that was the sort of thing I was expecting to see here. My guess is that a lot of people who follow the NBA also follow college basketball, many of whose games are played on Saturdays.
- You have a friend who lives in Auckland, New Zealand, who is a big basketball fan. They have a streaming package that enables them to watch any NBA game live. They get home from work at 4:00pm (local time) every day. What are some games they would have been able to watch from start to finish as they happen? Use tools we have seen in lecture to find this out. Hints below:
- your friend needs to know about games that start at 4:00pm or later Auckland time (16:00 or later).
- you can use
uniteto glue a date and time together as text - if your time does not have seconds, omit the
sin the appropriate function - when you create a date-time that needs to be in a certain timezone, add the timezone when you create it
America/Torontowill do for Eastern time; useOlsonNames()to get a list of all the time zone names that R knows about. (The output fromOlsonNames()is long, so just find what you need and use that.)
Again, thinking first:
- we are talking time zones, so we need actual date-times with timezones.
- the
unitehint above says how to make a date-time as text - use something like
ymd_hmsto make an actual date-time. Here, the months are first and the years last, and there are no seconds, so usemdy_hminstead. (All these permutations exist.) - add a timezone when you are creating the date-times (Eastern time), otherwise you’ll get the UTC timezone.
- find out using
OlsonNames()that the time zone that Auckland, NZ is in is calledPacific/Auckland(down near the bottom; there is also a timezone called simplyNZthat should also work. New Zealand as a whole has only one timezone.) - use
with_tzto move a date-time from one timezone to another.
I know the start of a basketball game as “tip” or “tip-off”, hence my names for the columns below. Something like this is what you need:
nba %>% unite(tip_text, date, start_eastern) %>%
mutate(tip = mdy_hm(tip_text, tz = "America/Toronto")) %>%
mutate(tip_nz = with_tz(tip, "Pacific/Auckland"))If your friend has the patience, they can scroll down through 1200 rows and find the ones that start at or after 16:00 in the tip_nz column. But we can do better than that: pull out the hour from tip_nz, and grab the ones that are 16 or more:
nba %>% unite(tip_text, date, start_eastern) %>%
mutate(tip = mdy_hm(tip_text, tz = "America/Toronto")) %>%
mutate(tip_nz = with_tz(tip, "Pacific/Auckland")) %>%
mutate(tip_nz_hour = hour(tip_nz)) %>%
filter(tip_nz_hour >= 16)Thus, for example, Golden State at Denver and Portland at Sacramento on Nov 8, LA Lakers at Phoenix and Oklahoma City at Sacramento on Nov 10, and so on. Say something like this. However you find them is good, but the way I just did it is best. There are 132 games your friend can watch after they get home from work, some of which are at the same time (so they will have to choose which one to watch live).
Extra 1: As you page through the games, you will see that they are all on the west coast (Pacific time zone). This is because the time difference between North America and New Zealand is such that all the other, earlier, games are on while your friend is at work. (You might argue that a game that starts at 6:00am Auckland time is also one your friend might be able to watch, if they get up early enough, because it would be before they go to work.)
Extra 2: If you look at the columns I called tip and tip_nz (in the dataframe with all 1200 games), you can see that at the start, they are 17 hours apart and all of the games are too early for your friend to watch (even the ones that start at 10:00pm Eastern). Once you get into November, the time difference becomes 18 hours, because daylight savings ends in November in North America. Near the end of the dataframe, on March 10, the difference again becomes 17 hours, because daylight saving time starts again in North America, but on April 6, it changes again to 16 hours. Why? Because New Zealand’s daylight saving time ends (it is at this point fall in New Zealand).4
What that means is that the NBA games start late enough for your Auckland friend to watch only when there is an 18 hour time difference, because only then do the 10:00pm Eastern (7:00pm Pacific) games match up to 4:00pm the next day in Auckland. When the time difference is less than that, the 10:00pm Eastern games start earlier than 4:00pm in Auckland.
I don’t know about you, but I’m quite glad to have with_tz handle all of that for us, and get an answer for your friend that actually would have worked.
Footnotes
For example, add up the males and females, or add up the numbers of each species.↩︎
This is why I had to work so hard to have you find some predictions that were not close to 0 or 1.↩︎
Look carefully to see which of those two close-together traces, because of the scale, is green and which one is blue.↩︎
The other end of NZ’s daylight saving time is in September, before the NBA season starts. That is when it starts, because at that point it is spring in NZ.↩︎