Worksheet 3

Published

January 22, 2026

Packages

library(tidyverse)
library(marginaleffects)
library(MASS, exclude = "select")
library(nnet)

Log odds and poisoning rats

In one of the examples from lecture, we learned about modelling the probability that a rat would live as it depended on the dose of a poison. Some of the output from the logistic regression is as shown below:

summary(rat2.1)


Call:
glm(formula = response ~ dose, family = "binomial", data = rat2)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.3619     0.6719   3.515 0.000439 ***
dose         -0.9448     0.2351  -4.018 5.87e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 27.530  on 5  degrees of freedom
Residual deviance:  2.474  on 4  degrees of freedom
AIC: 18.94

Number of Fisher Scoring iterations: 4

For the calculations below, I suggest you use R as a calculator. If you prefer, use your actual calculator, but then your numerical answers will need to be sufficiently close to being correct in order to have gotten any credit (when this was an assignment problem).

Using the summary output, obtain a prediction for a dose of 3.2 units. What precisely is this a prediction of?

Do it as if you are doing a regression prediction, B22-style, by substituting 3.2 into the (logistic) regression equation:

log_odds <- 2.3619 - 0.9448 * 3.2
log_odds

[1] -0.66146

This, precisely, is the predicted log-odds that a rat given this dose will live.

I don’t know about you, but I have no intuition for how big a probability this is, yet. All I can really say is that it is negative, so it goes with a probability less than 0.5. (It is a high dose by the standards of the data, so it is more likely than not that a rate given this dose will die.)

(I saved my log-odds to use in the next step, but for the purposes of this problem, I have no objection if you copy and paste the answer.)

Convert your prediction into a predicted probability that a rat given this dose will live. Hint: if probability is denoted \(p\) and odds \(d\), we saw in class that \(d = p/(1-p)\). It follows (by algebra that I am doing for you) that \(p = d/(1+d)\).

Two steps: convert the log-odds into odds by exp-ing it:

odds <- exp(log_odds)
odds

[1] 0.5160973

and then use the formula in the hint to convert this to a probability:

odds / (1 + odds)

[1] 0.3404117

The probability that a rat given this dose will live is only 0.34.

Extra: to see whether this is right, run the code in the lecture notes (up to the model I called rat2.1) and then use predictions:

new <- tibble(dose = 3.2)
cbind(predictions(rat2.1, newdata = new)) %>% 
  select(dose, estimate)

Check, to within rounding.¹

In the output given at the top of this question, there is a number \(-0.9448\). What is the interpretation of this number? (If you prefer, you can instead interpret the exp of this number.)

This is the slope² of the logistic regression.

Two choices, therefore:

the original slope -0.9448 says that if you increase dose by 1, you decrease the log-odds of living by 0.9448.
the exp of the slope (see below) is about 0.39, so if you increase dose by 1, you multiply the odds of living by 0.39.

exp(-0.9448)

[1] 0.3887573

Increasing the dose makes the odds, and therefore the probability, of living smaller, which makes sense given that we are talking about the dose of a poison.

Extra: Interpreting the slope is talking about changes, and this doesn’t naturally fit to changes in probabilities. I wanted to do a numerical example to show you what I mean, since I don’t know how well this came across in lecture. Let’s pick several doses, 1 unit apart:

new <- tibble(dose = 1:6)
cbind(predictions(rat2.1, newdata = new)) %>% 
  select(estimate, dose)

The thing about this is that the change in probability as you increase dose by 1 is not constant: sometimes it’s as big as 0.23 and sometimes as small as 0.05.

Compare that with this:

cbind(predictions(rat2.1, newdata = new, type = "link")) %>% 
  select(estimate, dose)

Using type = "link" produces predictions on the log-odds scale.³ This time, you see that the estimate values decrease by about 0.95 every time, no matter what dose you are looking at. This “decrease of about 0.95” is actually the \(-0.9448\) that is the slope. Thus, increasing the dose by 1 decreases the log-odds of living by 0.9448 no matter what dose you’re looking at.

As I said, I don’t have much feeling for log-odds, but maybe I do have some intuition for odds themselves. Let’s turn those last predictions into odds by exp-ing them:

cbind(predictions(rat2.1, newdata = new, type = "link")) %>% 
  select(estimate, dose) %>% 
  mutate(estimate = exp(estimate))

At dose 1, a rat is odds-on to live (more likely than not), but as the dose increases, a rat definitely becomes less likely to live.

As you look down the numbers in the estimate column now, you might be able to see that each one is a bit more than a third of the previous one. The multiplier is actually this:

exp(-0.9448)

[1] 0.3887573

so maybe “a multiplier of 0.4” would be more accurate. This says that if you increase the dose by 1, you multiply the odds of living by about 0.4, no matter what dose you were looking at.

Thus:

the slope tells you how much the log-odds changes as the \(x\) increases by 1
the exp of the slope tells you by what multiple the odds change as the \(x\) changes by 1⁴
neither of those tell you anything about how the probability itself changes in a predictable way (because it doesn’t).

Carrots

In a consumer study, 103 consumers scored their preference of 12 Danish carrot types on a scale from 1 to 7, where 1 represents “strongly dislike” and 7 represents “strongly like”. The consumers also rated each carrot type on some other features, and some demographic information was collected. The data are in http://ritsokiguess.site/datafiles/carrots_pref.csv. We will be predicting preference score from the type of carrot and how often the consumer eats carrots (the latter treated as quantitative):

Frequency: how often the consumer eats carrots: 1: once a week or more, 2: once every two weeks, 3: once every three weeks, 4: at least once month, 5: less than once a month. (We will treat this as quantitative.)
Preference: consumer score on a seven-point scale, 7 being best
Product: type of carrot (there are 12 different named types).

Read in and display (some of) the data.

Before I do that, the packages you’ll need this time are probably these:

library(tidyverse)
library(MASS, exclude = "select")
library(marginaleffects)
# library(conflicted)
# conflicts_prefer(dplyr::select)
# conflicts_prefer(dplyr::filter)

You’ll certainly need the top three (as usual, for model-fitting, and for predictions, respectively). Because we are using MASS and there is the select issue, you need to do one of two things: either explicitly not import the select that lives in MASS (the exclude thing), or, if you forget that, load the conflicted package (the commented-out lines at the bottom).

If you’re using conflicted (which you may have to install first), you will at first get what seem like a ton of errors, because R will throw an error every time you use a function like select or filter that exists in more than one of the packages you have loaded. The error message will tell you how to choose the one you want; that’s what those last two lines are for. The select and filter we will be using are the tidyverse ones, which live in a part of the tidyverse called dplyr, so copy the code from the error messages with conflicts_prefer and dplyr in it, and paste in somewhere so it will run next time (my preferred place is after all the library lines where I load my packages). The next time R runs into a select, it will know which one you want, and won’t throw an error about it.

If you nonetheless get an error message about select but your use of select is correct (in tidyverse terms), that means that R is getting confused about which select you mean, and you might need to restart your R session (Session, Restart R) to keep things in line.

The coder in you might be happier (in general) to use the conflicted package, even though it takes some more work to get things set up, because then it’s absolutely clear which package each function you are running is coming from. (The business with filter is nothing to do with MASS: it so happens that dplyr (in the tidyverse) has the filter that we learned about in C32, but there is also a filter in the stats package that is loaded every time you run R⁵, and so conflicted wants to know which of those two versions of filter is the one you mean: for us, the one in dplyr.)

Reading the data, though, is as usual:

my_url <- "http://ritsokiguess.site/datafiles/carrots_pref.csv"
carrots <- read_csv(my_url)

Rows: 1236 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Product
dbl (2): eat_carrots, Preference

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

carrots

You might be wondering why there are so many rows. This is because there were 103 consumers, and each one rated 12 different carrot types (the ones named in Product), so there should be this many rows in the “longer” layout that we have:

103 * 12

[1] 1236

and so there are. (This would admittedly have been clearer if we had had an ID for each consumer; then, it would have been easier to see where the preferences for one consumer ended and the ones for the next one began.)

Why would ordinal logistic regression be a sensible method of analysis here?

The response variable Preference is categorical (even though written as a number) , and the categories are in order with 1 being worst and 7 being best.

Fit an ordinal logistic regression to this dataset. You do not need to display any output from this model yet. Hint: Preference is actually categorical, even though it looks like a number, so you should make sure that R treats it as categorical.

polr from package MASS. Also, Preference, which looks like a number, should be a genuine factor, so make it one when you fit the model (or, create a column using mutate that is the factor version of Preference).

carrots.1 <- polr(factor(Preference) ~ eat_carrots + Product, data = carrots)

Can any explanatory variables be removed? Explain briefly.

drop1 works just fine for these models, so is the right tool for finding this out:

drop1(carrots.1, test = "Chisq")

The P-value for eat_carrots is 0.095 (watch the scientific notation!), so this can be removed. This means that people’s rating for a carrot is not affected by how much they tend to eat carrots.

Note that the categorical Product is treated as a whole (which is a reason to use drop1): we know that Product has to stay, that is, that there are some differences in preference among the carrot types.

In the previous part, you didn’t know how you were going to be displaying the model, so it was better to wait and figure out what you needed to look at first.

If necessary, fit an improved model. (If not, explain briefly why not.)

Copy, paste, and edit, or use update:

carrots.2 <- update(carrots.1, .~. - eat_carrots)

If you didn’t think there was anything to remove, say why you thought that (for example, that you thought both explanatory variables were significant and so should not be removed). I am likely to disagree with your reasoning here, but as long as you are consistent the rest of the way, this part would have been the only place you lost points.

We will be predicting probabilities of each rating category for each of the explanatory variables remaining in the best model. Make a dataframe that includes all the different types of carrot, and the values 1 and 5 for eat_carrots if that is in your best model. Hint: you can use count to get all the levels of a categorical variable.

My best model does not include eat_carrots, so I only need the twelve types of carrot, which is a categorical variable.

carrots %>% count(Product) -> new
new

If you have eat_carrots in your model as well, you can still use count to get all the combinations of values, and then you need to get rid of the ones you don’t want:

carrots %>% count(eat_carrots, Product)

carrots %>% count(eat_carrots, Product) %>% 
  filter(eat_carrots == 1 | eat_carrots == 5) -> new2
new2

The extra column n can be safely ignored, because if predictions finds an extra column in newdata it will ignore it.

There are 12 varieties of carrot, so if that is all that is left in your model, your new should have 12 rows. If you have two values of eat_carrots going with each of those, your new will have 24 rows.

Predict the probability of a customer giving each carrot type each preference score. Display your results in such a way that you can easily compare the probability of each score for different types of carrot.

This is predictions, with the usual request to keep only the relevant columns:

cbind(predictions(model = carrots.2, newdata = new)) %>% 
  select(Product, estimate, group)


Re-fitting to get Hessian

These are the predictions all right, but 84 rows is a lot to assess. Two points out of three (when this was graded) for getting this far.

group is the score category, and I think it would be easier to have the predicted probabilities for the different groups going across the page. (Feel free to disagree, but you need to make the case for leaving it like this, if that’s what you choose to do.)

cbind(predictions(model = carrots.2, newdata = new)) %>% 
  select(Product, estimate, group) %>% 
  pivot_wider(names_from = group, values_from = estimate)


Re-fitting to get Hessian

Depending on the width of your page, you might find that the predicted probabilities now go off the right side. This was acceptable when this was an assignment, but the very best answer would round the probabilities off to, say, 2 or 3 decimals so that you can see all seven of them⁶ even on a narrow screen. You can do that all at once like this:

cbind(predictions(model = carrots.2, newdata = new)) %>% 
  select(Product, estimate, group) %>% 
  pivot_wider(names_from = group, values_from = estimate) %>% 
  mutate(across(where(is.numeric), \(x) round(x, 2)))


Re-fitting to get Hessian

mutate with across redefines all the columns “in place” (so that the unrounded values are lost unless you calculate them again, but we only wanted to see the rounded ones).

The last point was for making a decent attempt to show the probabilities of each preference value for each carrot in a readable way.

If your model had eat_carrots in it, your answers will look different, but the same considerations otherwise apply:

cbind(predictions(model = carrots.1, newdata = new2)) %>% 
  select(eat_carrots, Product, estimate, group) %>% 
  pivot_wider(names_from = group, values_from = estimate) %>% 
  mutate(across(where(is.numeric), \(x) round(x, 2)))


Re-fitting to get Hessian

There was a significant difference in preference scores among the different types of carrot. What do your predictions tell you about why that is? Explain briefly.

See whether there are any carrots that are generally liked more or less than the others,

Casting my eye down the predictions, it seems that most Products are most likely to be rated 5. Exceptions are Major_E, Turbo_E, and Yukon_E, which are more likely to be rated 4, and are thus liked less than the others. You might also notice that Bolero_L and Yukon_L are the most likely to be rated 6 or 7, so people like them the most.

Pick out at least one product that is liked more or less than the others, to justify the significance of Product. Also, say something about whether the products you named are liked more or less overall than the others. (Because the response category is ordered, having a higher probability of being a 6 or 7 means that people like it more, and having a higher probability of a 4 or even a 3 means that people like it less. Very few people gave a 1 or a 2 rating to anything.)

If you had eat_carrots in your predictions, focus on just one of its values. The overall pattern should otherwise be the same as above.

Palmer penguins

The penguins dataset in the package palmerpenguins contains body measurements and other information for 344 adult penguins that were observed at the Palmer Archipelago in Antarctica. Variables of interest to us are:

species: the species of the penguin (one of Adelie, Chinstrap, Gentoo)
bill_length_mm: bill length in mm
bill_depth_mm: bill depth (from top to bottom) in mm
flipper_length_mm: flipper length in mm
body_mass_g: body mass in grams
sex: whether the penguin is male or female

In particular, is it possible to predict the species of a penguin from the other variables?

Load the package (installing it first if you need to) and display the dataset.

That means this:

library(palmerpenguins)


Attaching package: 'palmerpenguins'

The following objects are masked from 'package:datasets':

    penguins, penguins_raw

penguins

There are some missing values in the dataset. Remove all the observations that have missing values on any of the variables, and save the resulting dataset.

This is what drop_na does. You can either feed it some columns to check for missings in, or if you feed it no columns at all, it will check for missings everywhere. I saved it back into the original dataset, but you can certainly create a new one if you prefer:

penguins %>% drop_na() -> penguins

If you have another way that is consistent with the way we have learned things, that is also good. (Ideas from base R are not consistent with our approach.)

Extra: to convince myself that all the missing values are indeed gone:

summary(penguins)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :146   Biscoe   :163   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :123   1st Qu.:39.50   1st Qu.:15.60  
 Gentoo   :119   Torgersen: 47   Median :44.50   Median :17.30  
                                 Mean   :43.99   Mean   :17.16  
                                 3rd Qu.:48.60   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172       Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190       1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197       Median :4050                Median :2008  
 Mean   :201       Mean   :4207                Mean   :2008  
 3rd Qu.:213       3rd Qu.:4775                3rd Qu.:2009  
 Max.   :231       Max.   :6300                Max.   :2009

and none of the columns do indeed have any missing values remaining in them. There appear to be 333 penguins left,⁷ so we had to get rid of 11 of them.

Why is polr from MASS not appropriate for fitting a model to predict species from the other variables?

polr assumes that the categorical response variable species has ordered categories, but in this case the species are just names or labels, and there is no (unambiguous) way to put them in order (such as from “low” to “high”).

Hence, looking just ahead, we will need to use multinom from the nnet package to fit our models. Make sure you have nnet installed.

Fit a suitable model predicting species from the other variables (the fitting process will produce some output).

From the previous question, we have ruled out polr, so:

penguins.1 <- multinom(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm +
                         body_mass_g + sex, data = penguins)

# weights:  21 (12 variable)
initial  value 365.837892 
iter  10 value 16.719027
iter  20 value 3.047829
iter  30 value 0.103484
iter  40 value 0.063985
iter  50 value 0.006201
iter  60 value 0.003924
iter  70 value 0.001733
iter  80 value 0.001547
iter  90 value 0.001513
iter 100 value 0.001498
final  value 0.001498 
stopped after 100 iterations

Extra: should you be curious:

summary(penguins.1)

Call:
multinom(formula = species ~ bill_length_mm + bill_depth_mm + 
    flipper_length_mm + body_mass_g + sex, data = penguins)

Coefficients:
          (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
Chinstrap -103.422229       16.51642     -24.41758       -0.31003929
Gentoo      -6.756321       12.21924     -30.09757       -0.08758996
           body_mass_g    sexmale
Chinstrap -0.030200928 -15.820607
Gentoo     0.005019684  -8.132101

Std. Errors:
          (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
Chinstrap   0.5715154      27.749774      27.43640          4.097365
Gentoo      0.5715734       8.968787      33.25608         11.905692
          body_mass_g  sexmale
Chinstrap   0.1350169 3.077521
Gentoo      0.5848602 3.498288

Residual Deviance: 0.002996934 
AIC: 24.003

This doesn’t really tell us very much. The coefficients are kind of slopes (treating the first category Adelie as baseline), and there are some log-odds hiding in the background, but for knowing things of actual interest to us, like “can we remove anything from the model?”, this is not what to look at.

Use step to see whether anything can be removed from your model. Which variables remain after step has finished? (Hints: save the output from step, because this is actually the best model that step found. Also, the fitting process will probably produce a lot of output, which, for this question, you can include in your assignment.)

First off, you are probably thinking drop1, but remember that this doesn’t work for these models (no, I don’t know why not either):

drop1(penguins.1, test = "Chisq")

trying - bill_length_mm

Error in if (trace) {: argument is not interpretable as logical

so step it is:

penguins.2 <- step(penguins.1)

Start:  AIC=24
species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + 
    body_mass_g + sex

trying - bill_length_mm 
# weights:  18 (10 variable)
initial  value 365.837892 
iter  10 value 129.185715
iter  20 value 114.018556
iter  30 value 113.336460
iter  40 value 113.216108
final  value 113.215886 
converged
trying - bill_depth_mm 
# weights:  18 (10 variable)
initial  value 365.837892 
iter  10 value 66.463118
iter  20 value 12.514139
iter  30 value 7.743531
iter  40 value 5.662963
iter  50 value 4.350499
iter  60 value 4.145847
iter  70 value 4.093510
iter  80 value 3.984724
iter  90 value 3.903126
iter 100 value 3.831010
final  value 3.831010 
stopped after 100 iterations
trying - flipper_length_mm 
# weights:  18 (10 variable)
initial  value 365.837892 
iter  10 value 25.561758
iter  20 value 1.350538
iter  30 value 0.065803
iter  40 value 0.020499
iter  50 value 0.016911
iter  60 value 0.012712
iter  70 value 0.005348
iter  80 value 0.001290
iter  90 value 0.000225
iter 100 value 0.000216
final  value 0.000216 
stopped after 100 iterations
trying - body_mass_g 
# weights:  18 (10 variable)
initial  value 365.837892 
iter  10 value 9.579038
iter  20 value 4.114389
iter  30 value 2.340467
iter  40 value 0.614487
iter  50 value 0.055642
iter  60 value 0.012303
iter  70 value 0.007327
iter  80 value 0.007163
iter  90 value 0.007015
iter 100 value 0.006995
final  value 0.006995 
stopped after 100 iterations
trying - sex 
# weights:  18 (10 variable)
initial  value 365.837892 
iter  10 value 16.321214
iter  20 value 3.754897
iter  30 value 1.631859
iter  40 value 0.012427
iter  50 value 0.001125
iter  60 value 0.001108
iter  70 value 0.001006
iter  80 value 0.000906
iter  90 value 0.000498
iter 100 value 0.000498
final  value 0.000498 
stopped after 100 iterations
                    Df       AIC
- flipper_length_mm 10  20.00043
- sex               10  20.00100
- body_mass_g       10  20.01399
<none>              12  24.00300
- bill_depth_mm     10  27.66202
- bill_length_mm    10 246.43177
# weights:  18 (10 variable)
initial  value 365.837892 
iter  10 value 25.561758
iter  20 value 1.350538
iter  30 value 0.065803
iter  40 value 0.020499
iter  50 value 0.016911
iter  60 value 0.012712
iter  70 value 0.005348
iter  80 value 0.001290
iter  90 value 0.000225
iter 100 value 0.000216
final  value 0.000216 
stopped after 100 iterations

Step:  AIC=20
species ~ bill_length_mm + bill_depth_mm + body_mass_g + sex

trying - bill_length_mm 
# weights:  15 (8 variable)
initial  value 365.837892 
iter  10 value 136.772749
iter  20 value 133.782638
iter  30 value 133.545483
final  value 133.538635 
converged
trying - bill_depth_mm 
# weights:  15 (8 variable)
initial  value 365.837892 
iter  10 value 30.179445
iter  20 value 5.699489
iter  30 value 5.175455
iter  40 value 5.153196
iter  50 value 5.070255
iter  60 value 4.997789
iter  70 value 4.893410
iter  80 value 4.826209
iter  90 value 4.708542
iter 100 value 4.695168
final  value 4.695168 
stopped after 100 iterations
trying - body_mass_g 
# weights:  15 (8 variable)
initial  value 365.837892 
iter  10 value 11.953428
iter  20 value 3.975600
iter  30 value 2.172365
iter  40 value 2.111553
iter  50 value 1.881400
iter  60 value 1.236767
iter  70 value 1.133364
iter  80 value 1.059253
iter  90 value 0.588576
iter 100 value 0.490499
final  value 0.490499 
stopped after 100 iterations
trying - sex 
# weights:  15 (8 variable)
initial  value 365.837892 
iter  10 value 20.098944
iter  20 value 3.621850
iter  30 value 1.160548
iter  40 value 1.108738
iter  50 value 1.063387
iter  60 value 0.958674
iter  70 value 0.926099
iter  80 value 0.837719
iter  90 value 0.797644
iter 100 value 0.741759
final  value 0.741759 
stopped after 100 iterations
                 Df       AIC
- body_mass_g     8  16.98100
- sex             8  17.48352
<none>           10  20.00043
- bill_depth_mm   8  25.39034
- bill_length_mm  8 283.07727
# weights:  15 (8 variable)
initial  value 365.837892 
iter  10 value 11.953428
iter  20 value 3.975600
iter  30 value 2.172365
iter  40 value 2.111553
iter  50 value 1.881400
iter  60 value 1.236767
iter  70 value 1.133364
iter  80 value 1.059253
iter  90 value 0.588576
iter 100 value 0.490499
final  value 0.490499 
stopped after 100 iterations

Step:  AIC=16.98
species ~ bill_length_mm + bill_depth_mm + sex

trying - bill_length_mm 
# weights:  12 (6 variable)
initial  value 365.837892 
iter  10 value 150.109767
iter  20 value 142.419807
iter  30 value 142.047323
final  value 142.041293 
converged
trying - bill_depth_mm 
# weights:  12 (6 variable)
initial  value 365.837892 
iter  10 value 139.953898
iter  20 value 133.697479
iter  30 value 133.399000
iter  40 value 133.326376
final  value 133.325373 
converged
trying - sex 
# weights:  12 (6 variable)
initial  value 365.837892 
iter  10 value 26.650783
iter  20 value 23.943597
iter  30 value 23.916873
iter  40 value 23.901339
iter  50 value 23.895442
iter  60 value 23.894251
final  value 23.892065 
converged
                 Df       AIC
<none>            8  16.98100
- sex             6  59.78413
- bill_depth_mm   6 278.65075
- bill_length_mm  6 296.08259

This is where you get rather a lot of output. step has to check the effect of removing each explanatory variable from each model it fits, and multinom itself produces half a dozen or so lines of output each time it is run.

To see which variables are left, either read them off from the output here, or look at summary of your new model just enough to see which variables it has in it:

summary(penguins.2)

Call:
multinom(formula = species ~ bill_length_mm + bill_depth_mm + 
    sex, data = penguins)

Coefficients:
          (Intercept) bill_length_mm bill_depth_mm     sexmale
Chinstrap   -242.9740       14.88938     -21.91411 -36.9051125
Gentoo       121.0476       14.80645     -44.69711  -0.1301506

Std. Errors:
          (Intercept) bill_length_mm bill_depth_mm  sexmale
Chinstrap    3.985819       9.369182      23.16685 23.33386
Gentoo       4.071031       9.413971      23.17697 43.25185

Residual Deviance: 0.9809986 
AIC: 16.981

Whichever way you slice it, bill length, bill depth, and sex remain; flipper length and body mass have been removed.

For the predictions below, use this model.

Obtain predicted probabilities of each species from your best model for all nine combinations of: bill length 39.6, 40.0, and 40.4, bill depth 15.8, 16.0, and 16.2, sex female (only). Display those predicted probabilities in a way that allows for easy comparison of the three species probabilities for given values of the other variables.

Three stages, as we have seen before (a point each):

make a dataframe (perhaps called new) of all the combination of explanatory variables
do the predictions
rearrange the predictions so that they are easy to compare.

For the first step: datagrid does all possible combinations, so that is the way to go. When there are a lot of things that you want combinations of, it can be clearer to define them first. The best model is the one output by step, which I called penguins.2:

lengths <- c(39.6, 40.0, 40.4)
depths <- c(15.8, 16.0, 16.2)
sexes <- "female"
new <- datagrid(model = penguins.2, bill_length_mm = lengths, bill_depth_mm = depths, sex = sexes)
new

There are \(3 \times 3 \times 1 = 3^2 = 9\) combinations of values. My habit, when I define the values first like this, is to use a plural name for the values, so that when I construct new, it’s “singular = plural” all the way through.

If you prefer to put the actual values inside the datagrid, you can do that, but then you will need to split the line of code so that it is easier to see all the values, for example a new line for each variable, like this:

new0 <- datagrid(model = penguins.2, 
                 bill_length_mm = c(39.6, 40.0, 40.4),
                 bill_depth_mm = c(15.8, 16.0, 16.2),
                 sex = "female")
new0

Doing it this way, with each variable on a new line, makes it easier to read.

Second, the predictions. Save the explanatory variable names, the column group that says which species the predicted probability is for, and the column estimate of the predicted probabilities themselves:

cbind(predictions(model = penguins.2, newdata = new)) %>%
  select(group, estimate, bill_length_mm:sex)

Third, pivot-wider the estimated probabilities into columns named for the species. Mine came out wider than my display, so I additionally rounded off the probabilities to make the columns narrower:

cbind(predictions(model = penguins.2, newdata = new)) %>%
  select(group, estimate, bill_length_mm:sex) %>% 
  pivot_wider(names_from = group, values_from = estimate) %>% 
  mutate(across(Adelie:Gentoo, \(x) round(x, 3))) # for each column from Adelie to Gentoo, round it to 3 decimals

A second way to do this is to round off the predicted probabilities before you pivot-wider, which is easier to do but harder to see that you can do it this way. Run the code as far as the select to figure out how you need to do it: there’s a column called estimate with all the predicted probabilities in it, so round this column to 3 decimals before you proceed with the pivot_wider:

cbind(predictions(model = penguins.2, newdata = new)) %>%
  select(group, estimate, bill_length_mm:sex) %>% 
  mutate(estimate = round(estimate, 3)) %>% 
  pivot_wider(names_from = group, values_from = estimate)

Extra: I had to play around a bit with the numbers I wanted you to get predictions for. The problem is that the species are actually very distinct (as you will see with your graph), and so if you are not careful with your values, you will get a lot of predicted probabilities that are close to zero and one. I ended up restricting things to just females for this question because (as it turns out) males have a similar pattern but are just bigger overall, and the pattern was easier to see with just one sex.

Using your predictions, what combinations of values for bill length and bill depth distinguish the three species?

Find the largest predicted probabilities for each species and say something about whether they go with small or large values on each variable:

Adelie have small bill length and large bill depth
Chinstrap have large bill length and medium or large bill depth
Gentoo have small bill depth (but can have any bill length).

Another way you can go is to look at the combinations of large and small values on each explanatory variable:

large bill length and large bill depth is probably Chinstrap
large bill length and small bill depth is probably Gentoo
small bill length and large bill depth is (almost certainly) Adelie
small bill length and small bill depth is probably Gentoo (there actually aren’t really any of these, as we’ll see from a graph shortly).

Make a plot that includes bill length, bill depth, species, and sex. Does your plot support your conclusions from the previous question?

This plot is one you would make from the data, not of the predictions from the model (which is rather hard to interpret).

This has two quantitative variables (the first two) and two categorical (the last two). If you only had one categorical variable, this would be a scatterplot with the levels of the categorical variable as colour. The procedure we have seen for dealing with “too many” categorical variables is to use facets. My take is that sex is explanatory, so that is better as facets, and hence:

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, colour = species)) +
  geom_point() + facet_wrap(~ sex)

I am deliberately not using scales = "free" because I want to be able to compare where the points are on the two facets.

I have to say that I think these graphs are actually the best way towards understanding here. The benefit of fitting the model was to show that these variables are the ones that will help to distinguish the species.

The bill length and bill depth do a great job of distinguishing the penguin species:

small bill length (and large bill depth) is Adelie
large bill length and large bill depth is Chinstrap
large bill length and small bill depth is Gentoo
this pattern is the same for males and females, but the males are bigger on both measurements than the females. (This last is the reason for having the same scales in both facets, and was also the reason for having you look at just females previously.)

For you, say something sensible about how the graph (for females) is consistent or not with what you found in the previous question.

When I looked at the predicted probabilities, I got the same thing as this except that there, I predicted Gentoo for small bill depth and all bill lengths. (What actually happened is that there really weren’t any small bill lengths and small bill depths in the data, but the prediction had to come up with something, and the values we were predicting for were slightly closer to Gentoo than Adelie.)

Extra: I actually drew the graph first, and then tried to reverse-engineer a sensible prediction question for you.

plot_predictions doesn’t work so easily here because there are two explanatory variables in the model. To show both, because you want to keep colour for the species, you put the other ones on the end to make facets. This chooses five representative values for a quantitative explanatory variable and gives you a facet for each. I also added sex; plot_predictions will use facet_grid rather than facet_wrap when there are two explanatory variables after the thing that is colour. The facets for the first one are in columns and for the second one are in rows. The five different values for bill_depth_mm are hidden in the facet labels, and you can’t really see them, but take it that the plot is using something like a five-number summary:

plot_predictions(penguins.2, condition = c("bill_length_mm", 
                                           "group", "sex", 
                                           "bill_depth_mm"))

Warning: Removed 398 rows containing missing values or values outside the scale range
(`geom_ribbon()`).

The species are actually very distinguishable (as you saw from the plot you made for this part), so most of the predicted probabilities are close to zero or one.⁸ Reading these graphs down the left side (for females):

when bill depth is small, it is a Gentoo
when bill depth is small-to-moderate:
- when bill length is small, it is an Adelie
- when bill length is large, it is a Gentoo
when bill depth is large:
- when bill length is small, it is an Adelie
- when bill length is large, it is a Chinstrap.

The actual definitions of “small” and “large” for bill depth depend on the value of bill length.

This is more or less the same as our graph (the one I had you make), and the pattern is almost exactly the same for males and females, except that for females, the third facet down is the same as the last two, and for males, it is the same as the second one.⁹

Choice-box

A psychology experiment began by showing a video in which four German children demonstrated how to use a device called a “choice-box”, which consisted of three pipes. Three of the children in the video used pipe #1, demonstrating how to throw a ball into the pipe and receive a toy from a dispenser. The other child in the video used pipe #2, also throwing a ball into the pipe and receiving a toy from the dispenser. Pipe #3 was never used on the video.

The pipes on the choice-box were actually different colours, and different versions of the video were used in which the identity of pipes #1, #2, and #3 were varied at random, and the order of children using pipes #1 and #2 on the video were also varied at random: sometimes the three children demonstrating the same pipe appeared first, and sometimes the one child demonstrating the other pipe appeared first.

The 629 subjects of the experiment, who were other children of various ages, were each given one ball to use in the choice-box. The experimenter noted which pipe each subject threw the ball into, and how it related to the pipes used in the video that subject had watched. These are in the column y:

majority: the subject threw their ball into the pipe demonstrated by three children on their video (what I called pipe #1).
minority: the subject threw their ball into the pipe demonstrated by only one child on their video (what I called pipe #2).
unchosen: the subject threw their ball into the pipe demonstrated by none of the children on their video (what I called pipe #3). I should probably point out that these subjects got a toy from the dispenser as well.

The aim of the experiment was to see whether the subjects were influenced by what happened on the video they saw: for example, was a subject more likely to choose the pipe demonstrated three times on their video? The experimenters also recorded the gender, age, and culture of each subject (coded as C1 through C8), along with whether the video showed three children using pipe #1 first, or one child using pipe #2 first. Did these other variables have an effect on which pipe a subject chose? This kind of experiment might shed some light about how children are influenced by what they see and what changes it.

The data are in http://ritsokiguess.site/datafiles/Boxes.csv.

Read in and display (some of) the data.

As ever:

my_url <- "http://ritsokiguess.site/datafiles/Boxes.csv"
boxes <- read_csv(my_url)

Rows: 629 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): y, gender, culture, majority_first
dbl (1): age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

boxes

There are 629 rows (subjects), with one per row (which will mean later that we don’t have to deal with that weights thing).

Extra: the culture values are?

boxes %>% count(culture)

I had to work this out to finish writing the question. There are more children from culture C3 than any other.

What assumption is made about the response categories in order to use multinom from package nnet?

The response categories are assumed to be unordered (or, labels for different categories in no particular order).

You can decide for yourself whether you think the categories of y are actually ordered or unordered. For example, you might think that a sensible ordering would be majority, minority, unchosen, in descending order of how much influence the video had over the subject, but we are going to treat them as unordered here. (I carefully worded the question so that you didn’t have to choose, but rather only had to explain the implication of the choice that was made.)

Fit an appropriate model for predicting the (treated as unordered) category of y from the other variables. Include a squared term in age. You don’t need to display any results.

That means something like this:

boxes.1 <- multinom(y ~ gender + age + I(age^2) + culture + majority_first, 
                    data = boxes)

# weights:  39 (24 variable)
initial  value 691.027130 
iter  10 value 619.662201
iter  20 value 615.538678
final  value 615.520419 
converged

To find out what, if anything, you can remove from your model, use step. The input to step is a model (here, the one you fitted in the previous part). The output from step is another model, the one obtained by removing everything that can be removed. Save this model. Running step displays some additional output, showing you what it is doing. (You might find that there is a lot of additional output; that was fine to hand in on this assignment.)

Aside: In most cases, drop1 works just fine, and you can remove non-significant things one at a time. This, however, is not one of those cases:

drop1(boxes.1, test = "Chisq")

trying - gender

Error in if (trace) {: argument is not interpretable as logical

This is, as far as I can tell, an actual bug in multinom, or in the way multinom interfaces with drop1, one that no-one has gotten around to fixing.

End of aside.

So what you can do is to run step and let it do the removal of unimportant explanatory variables for you:

boxes.2 <- step(boxes.1)

Start:  AIC=1279.04
y ~ gender + age + I(age^2) + culture + majority_first

trying - gender 
# weights:  36 (22 variable)
initial  value 691.027130 
iter  10 value 623.801093
iter  20 value 618.975316
final  value 618.958678 
converged
trying - age 
# weights:  36 (22 variable)
initial  value 691.027130 
iter  10 value 620.650429
iter  20 value 617.395922
final  value 617.381771 
converged
trying - I(age^2) 
# weights:  36 (22 variable)
initial  value 691.027130 
iter  10 value 619.003503
iter  20 value 617.499856
final  value 617.497924 
converged
trying - culture 
# weights:  18 (10 variable)
initial  value 691.027130 
iter  10 value 624.110284
final  value 623.287626 
converged
trying - majority_first 
# weights:  36 (22 variable)
initial  value 691.027130 
iter  10 value 649.159141
iter  20 value 646.799762
final  value 646.785972 
converged
                 Df      AIC
- culture        10 1266.575
- age            22 1278.764
- I(age^2)       22 1278.996
<none>           24 1279.041
- gender         22 1281.917
- majority_first 22 1337.572
# weights:  18 (10 variable)
initial  value 691.027130 
iter  10 value 624.110284
final  value 623.287626 
converged

Step:  AIC=1266.58
y ~ gender + age + I(age^2) + majority_first

trying - gender 
# weights:  15 (8 variable)
initial  value 691.027130 
iter  10 value 626.524047
final  value 626.513205 
converged
trying - age 
# weights:  15 (8 variable)
initial  value 691.027130 
iter  10 value 627.242036
final  value 625.727509 
converged
trying - I(age^2) 
# weights:  15 (8 variable)
initial  value 691.027130 
iter  10 value 626.363996
final  value 626.092979 
converged
trying - majority_first 
# weights:  15 (8 variable)
initial  value 691.027130 
iter  10 value 654.784947
final  value 654.773132 
converged
                 Df      AIC
<none>           10 1266.575
- age             8 1267.455
- I(age^2)        8 1268.186
- gender          8 1269.026
- majority_first  8 1325.546

At the bottom of the extra output is a table showing what is in the final model from step: age, age-squared, gender, and majority_first. The top line <none> says that it is best, according to AIC (which is what step uses) to take out nothing else.

Extra: the reason there is a lot of extra output is that running multinom even once produces a progress report with lines “initial”, “iter” (sometimes a few of those), and “final”. The word “converged” below “final” shows that it worked properly (which it looked as if it did every time here).

The main reason for the long extra output, though, is that multinom had to be run a bunch of times:

starting from what I called boxes.1, take out each of the explanatory variables from the model one at a time and fit the model without each one, saving the AIC from each fit
make a table showing the resulting AIC for the removal of each explanatory variable, including removing nothing
remove the best thing to remove (here culture)
repeat the first two steps for the current best model
the best thing to remove is now nothing, so we are done, and the table shows what is still in our model.

For your best model, create a dataframe for predicting the probability that a child will choose the majority, minority, or unchosen pipe, for ages 5 through 13. What values have been used for the other explanatory variables?

This means using datagrid to supply some representative values for the other variables, and the ages we want. Use the model that came out of step, which I called boxes.2:

new <- datagrid(model = boxes.2, age = c(5:13))
new

As per tradition, my resulting dataframe is called new.

Two points were for that much. The other point was for saying what values of the other variables found their way into your dataframe and why. These are gender boy and majority_first being no (and y being majority, but this is not used in the predictions). These are categorical, and these categories appear here because they are the most frequent categories in the data.

Extra: you don’t need to demonstrate that your assertion is correct, but if you wanted to:

boxes %>% count(gender)

boxes %>% count(majority_first)

It is a close call, but there are a few more boys than girls in the dataset, and slightly more of the time, what I called pipe #2 was demonstrated first in the video that a subject saw.¹⁰

Calculate and display your predictions side by side with the corresponding explanatory variable values. Arrange your predictions in a way that makes them easier to compare.

predictions. Select the output columns you want to keep, and then pivot the group (predicted response category) wider:

cbind(predictions(boxes.2, newdata = new)) %>% 
  select(group, age, gender, majority_first, estimate) %>% 
  pivot_wider(names_from = group, values_from = estimate)

Two points for the first two lines, and a third point for doing some sensible rearrangement of the predictions to make them easier to read.

Extra: the effect of age is that the youngest and oldest children are most likely to follow the majority (what I called pipe #1); older children are less likely to follow the minority (vs younger children) and more likely to choose the pipe that was never chosen on their video. Come up with your own explanation of why you think that is.

(The age effect for girls is the same as for boys, at least according to our model, because there is no interaction between gender and age in the model. I tried adding one, but it was not significant, so the effect of age seems to be the same for boys and for girls.)

Plot the predictions as they depend on age. Hint: use the simplified procedure shown in lecture (which should also be in the slides).

Probably the easiest way to think about this is to start with what comes out of predictions before you tidy it up:

cbind(predictions(boxes.2, newdata = new))

The category of y being predicted for is called group (then, as you recall, we took the estimate for each group and pivoted them wider to make them easier to look at).

Then think back to what we did when plotting the predictions for the coal miners: we put the group on the end of the condition. In this case, that means having age first and then group:

plot_predictions(model = boxes.2, condition = c("age", "group"))

It is perfectly good (though more, unnecessary, work for you) to grapple with what is in the lecture notes:

plot_predictions(boxes.1, condition = c("age"), type = "probs", draw = FALSE)  %>% 
  ggplot(aes(x = age, y = estimate, colour = group)) +
  geom_line()

Extra: You might be wondering what values are being used for the other variables gender and majority_first in the graph. I think it is the same as for datagrid: the most common category. I haven’t checked this; one way you might investigate is to compare the predictions you got in the previous part with your graph here.

The way plot_predictions works is to put the first thing in condition on the \(x\)-axis, to use coloured curves for the second thing, and to use facets for the third thing (if there is one). This suggests that you might add gender to the plot like this:

plot_predictions(model = boxes.2, condition = c("age", "group", "gender"))

It seems to make sense to keep the colours to represent the outcome category.

What we see here is that regardless of age, girls are more likely than boys to go with the minority (pipe #2) on their video, and less likely to go with the majority (pipe #1).

It also looks as if the plot I had you make was indeed based on boys.

Footnotes

predictions is probably using more accurate numbers than I took from the summary output.↩︎
It’s actually a slope in terms of log-odds.↩︎
In general, for a generalized linear model, it gives predictions on the scale the model was fitted, which is log-odds for a logistic regression, the scale on which the original predictions are done.↩︎
Adding on a log scale is the same thing as multiplying on the original scale. This is why you can multiply numbers by adding their logs.↩︎
That filter has to do with time series, where “filter” has its own meaning.↩︎
If you are lucky.↩︎
For example, add up the males and females, or add up the numbers of each species.↩︎
This is why I had to work so hard to have you find some predictions that were not close to 0 or 1.↩︎
Look carefully to see which of those two close-together traces, because of the scale, is green and which one is blue.↩︎
The usual consent form for a study like this, probably signed by each subject child’s parents in this case, allows the subject to withdraw from the study at any time, resulting in some missing data if this happens. My guess is that the experimenters started with equal numbers of boys and girls, and equal numbers of subjects seeing pipe #1 first in their video, but a few of the subject children didn’t complete the study for whatever reason.↩︎