library(tidyverse)
library(marginaleffects)
library(nnet)
# library(MASS, exclude = "select")Worksheet 4
Packages
The first problem is a repeat from Worksheet 3, in case you didn’t do it before. The other problems are on dates and times.
Palmer penguins
The penguins dataset in the package palmerpenguins contains body measurements and other information for 344 adult penguins that were observed at the Palmer Archipelago in Antarctica. Variables of interest to us are:
species: the species of the penguin (one of Adelie, Chinstrap, Gentoo)bill_length_mm: bill length in mmbill_depth_mm: bill depth (from top to bottom) in mmflipper_length_mm: flipper length in mmbody_mass_g: body mass in gramssex: whether the penguin ismaleorfemale
In particular, is it possible to predict the species of a penguin from the other variables?
- Load the package (installing it first if you need to) and display the dataset.
- There are some missing values in the dataset. Remove all the observations that have missing values on any of the variables, and save the resulting dataset.
- Why is
polrfromMASSnot appropriate for fitting a model to predict species from the other variables?
- Fit a suitable model predicting species from the other variables (the fitting process will produce some output).
- Use
stepto see whether anything can be removed from your model. Which variables remain afterstephas finished? (Hints: save the output fromstep, because this is actually the best model thatstepfound. Also, the fitting process will probably produce a lot of output, which, for this question, you can include in your assignment.)
- Obtain predicted probabilities of each species from your best model for all nine combinations of: bill length 39.6, 40.0, and 40.4, bill depth 15.8, 16.0, and 16.2, sex female (only). Display those predicted probabilities in a way that allows for easy comparison of the three species probabilities for given values of the other variables.
- Using your predictions, what combinations of values for bill length and bill depth distinguish the three species?
- Make a plot that includes bill length, bill depth, species, and sex. Does your plot support your conclusions from the previous question?
Manchester United
Manchester United is one of the most famous soccer clubs in England and indeed the world. Information about the players (at some point in the past) is here: http://datafiles.ritsokiguess.site/manu.csv (it’s a CSV file). We are going to learn something about the ages of the players.
- Read in the file and display some of the resulting data frame.
- What kind of thing is the column
date_of_birth? Create a new column that contains the players’ dates of birth as actual R dates, and display the old dates of birth alongside your new column (or at least the first few rows of them). Save your updated data frame.
- Treating the new column of dates as quantitative, make a suitable plot of these with the players’
positionon the field. What do you see on the quantitative axis?
- Is there a position where the players tend to be older? Explain briefly.
- (This is to prepare you for the next thing.) Work out how many years old you are by using something like
as.Date("2016-05-21")to turn your birth date into an R date, create aperiodfrom the interval from it to now (usetoday()to get today’s date), and pull out the number of (completed) years. (If you don’t want to share your birth date, use any other date. I’m not checking.) Make sure to have the right number of brackets in the right places.
- Go back to the Manchester United players. Calculate a new column containing the age, in completed years, of each player as measured today (thus, for example, a player who is currently 29 years and some number of days old should be counted as 29 years old, even if the number of days is something like 364). Display your new column side by side with the one called
age. (This uses the same technique that you used to calculate your own age in the previous part, except that you don’t need anything likeas.Datebecause you converted the birth dates into RDates in an earlier part.)
- What does the previous question tell you about when I originally downloaded the data? (It is reasonable to assume that the ages in the original data were correct on the date I downloaded the data.)
:::
Watching the NBA
The NBA (National Basketball Association) runs North America’s major basketball league, whose games are played from October to April. The 2023-2024 schedule is at http://ritsokiguess.site/datafiles/nba_sched.csv. (This question came from an assignment where this was the current season.)
- Read in and display some of the data.
- NBA games are played on different days of the week. Which day of the week has the most games, and which day of the week has the fewest? Use the tools you saw in lecture, in this course and STAC32, to work this out. (Hint: does it seem to matter that the year only has two digits here?)
- You have a friend who lives in Auckland, New Zealand, who is a big basketball fan. They have a streaming package that enables them to watch any NBA game live. They get home from work at 4:00pm (local time) every day. What are some games they would have been able to watch from start to finish as they happen? Use tools we have seen in lecture to find this out. Hints below:
- your friend needs to know about games that start at 4:00pm or later Auckland time (16:00 or later).
- you can use
uniteto glue a date and time together as text - if your time does not have seconds, omit the
sin the appropriate function - when you create a date-time that needs to be in a certain timezone, add the timezone when you create it
America/Torontowill do for Eastern time; useOlsonNames()to get a list of all the time zone names that R knows about. (The output fromOlsonNames()is long, so just find what you need and use that.)