library(tidyverse)
library(marginaleffects)
library(MASS, exclude = "select")
library(nnet)Worksheet 3
Packages
Log odds and poisoning rats
In one of the examples from lecture, we learned about modelling the probability that a rat would live as it depended on the dose of a poison. Some of the output from the logistic regression is as shown below:
summary(rat2.1)
Call:
glm(formula = response ~ dose, family = "binomial", data = rat2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.3619 0.6719 3.515 0.000439 ***
dose -0.9448 0.2351 -4.018 5.87e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27.530 on 5 degrees of freedom
Residual deviance: 2.474 on 4 degrees of freedom
AIC: 18.94
Number of Fisher Scoring iterations: 4
For the calculations below, I suggest you use R as a calculator. If you prefer, use your actual calculator, but then your numerical answers will need to be sufficiently close to being correct in order to have gotten any credit (when this was an assignment problem).
- Using the
summaryoutput, obtain a prediction for a dose of 3.2 units. What precisely is this a prediction of?
- Convert your prediction into a predicted probability that a rat given this dose will live. Hint: if probability is denoted \(p\) and odds \(d\), we saw in class that \(d = p/(1-p)\). It follows (by algebra that I am doing for you) that \(p = d/(1+d)\).
- In the output given at the top of this question, there is a number \(-0.9448\). What is the interpretation of this number? (If you prefer, you can instead interpret the exp of this number.)
Carrots
In a consumer study, 103 consumers scored their preference of 12 Danish carrot types on a scale from 1 to 7, where 1 represents “strongly dislike” and 7 represents “strongly like”. The consumers also rated each carrot type on some other features, and some demographic information was collected. The data are in http://ritsokiguess.site/datafiles/carrots_pref.csv. We will be predicting preference score from the type of carrot and how often the consumer eats carrots (the latter treated as quantitative):
Frequency: how often the consumer eats carrots: 1: once a week or more, 2: once every two weeks, 3: once every three weeks, 4: at least once month, 5: less than once a month. (We will treat this as quantitative.)Preference: consumer score on a seven-point scale, 7 being bestProduct: type of carrot (there are 12 different named types).
- Read in and display (some of) the data.
- Why would ordinal logistic regression be a sensible method of analysis here?
- Fit an ordinal logistic regression to this dataset. You do not need to display any output from this model yet. Hint:
Preferenceis actually categorical, even though it looks like a number, so you should make sure that R treats it as categorical.
- Can any explanatory variables be removed? Explain briefly.
- If necessary, fit an improved model. (If not, explain briefly why not.)
- We will be predicting probabilities of each rating category for each of the explanatory variables remaining in the best model. Make a dataframe that includes all the different types of carrot, and the values 1 and 5 for
eat_carrotsif that is in your best model. Hint: you can usecountto get all the levels of a categorical variable.
- Predict the probability of a customer giving each carrot type each preference score. Display your results in such a way that you can easily compare the probability of each score for different types of carrot.
- There was a significant difference in preference scores among the different types of carrot. What do your predictions tell you about why that is? Explain briefly.
Palmer penguins
The penguins dataset in the package palmerpenguins contains body measurements and other information for 344 adult penguins that were observed at the Palmer Archipelago in Antarctica. Variables of interest to us are:
species: the species of the penguin (one of Adelie, Chinstrap, Gentoo)bill_length_mm: bill length in mmbill_depth_mm: bill depth (from top to bottom) in mmflipper_length_mm: flipper length in mmbody_mass_g: body mass in gramssex: whether the penguin ismaleorfemale
In particular, is it possible to predict the species of a penguin from the other variables?
- (1 point) Load the package and display the dataset.
- (2 points) There are some missing values in the dataset. Remove all the observations that have missing values on any of the variables, and save the resulting dataset.
- (2 points) Why is
polrfromMASSnot appropriate for fitting a model to predict species from the other variables?
- (2 points) Fit a suitable model predicting species from the other variables. There is no need to display any of the output from the model at this point (the fitting process will produce some output, which is fine to include).
- (3 points) Use
stepto see whether anything can be removed from your model. Which variables remain afterstephas finished? (Hints: save the output fromstep, because this is actually the best model thatstepfound. Also, the fitting process will probably produce a lot of output, which, for this question, you can include in your assignment.)
- (3 points) Obtain predicted probabilities of each species from your best model for all nine combinations of: bill length 39.6, 40.0, and 40.4, bill depth 15.8, 16.0, and 16.2, sex female (only). Display those predicted probabilities in a way that allows for easy comparison of the three species probabilities for given values of the other variables.
- (3 points) Using your predictions, what combinations of values for bill length and bill depth distinguish the three species?
- (3 points) Make a plot that includes bill length, bill depth, species, and sex. Does your plot support your conclusions from the previous question?
Choice-box
A psychology experiment began by showing a video in which four German children demonstrated how to use a device called a “choice-box”, which consisted of three pipes. Three of the children in the video used pipe #1, demonstrating how to throw a ball into the pipe and receive a toy from a dispenser. The other child in the video used pipe #2, also throwing a ball into the pipe and receiving a toy from the dispenser. Pipe #3 was never used on the video.
The pipes on the choice-box were actually different colours, and different versions of the video were used in which the identity of pipes #1, #2, and #3 were varied at random, and the order of children using pipes #1 and #2 on the video were also varied at random: sometimes the three children demonstrating the same pipe appeared first, and sometimes the one child demonstrating the other pipe appeared first.
The 629 subjects of the experiment, who were other children of various ages, were each given one ball to use in the choice-box. The experimenter noted which pipe each subject threw the ball into, and how it related to the pipes used in the video that subject had watched. These are in the column y:
majority: the subject threw their ball into the pipe demonstrated by three children on their video (what I called pipe #1).minority: the subject threw their ball into the pipe demonstrated by only one child on their video (what I called pipe #2).unchosen: the subject threw their ball into the pipe demonstrated by none of the children on their video (what I called pipe #3). I should probably point out that these subjects got a toy from the dispenser as well.
The aim of the experiment was to see whether the subjects were influenced by what happened on the video they saw: for example, was a subject more likely to choose the pipe demonstrated three times on their video? The experimenters also recorded the gender, age, and culture of each subject (coded as C1 through C8), along with whether the video showed three children using pipe #1 first, or one child using pipe #2 first. Did these other variables have an effect on which pipe a subject chose? This kind of experiment might shed some light about how children are influenced by what they see and what changes it.
The data are in http://ritsokiguess.site/datafiles/Boxes.csv.
- Read in and display (some of) the data.
- What assumption is made about the response categories in order to use
multinomfrom packagennet?
- Fit an appropriate model for predicting the (treated as unordered) category of
yfrom the other variables. Include a squared term inage. You don’t need to display any results.
- To find out what, if anything, you can remove from your model, use
step. The input tostepis a model (here, the one you fitted in the previous part). The output fromstepis another model, the one obtained by removing everything that can be removed. Save this model. Runningstepdisplays some additional output, showing you what it is doing. (You might find that there is a lot of additional output; that was fine to hand in on this assignment.)
- For your best model, create a dataframe for predicting the probability that a child will choose the majority, minority, or unchosen pipe, for ages 5 through 13. What values have been used for the other explanatory variables?
- Calculate and display your predictions side by side with the corresponding explanatory variable values. Arrange your predictions in a way that makes them easier to compare.
- Plot the predictions as they depend on age. Hint: use the simplified procedure shown in lecture (which should also be in the slides).