Worksheet 9

Published

March 13, 2026

Packages

library(tidyverse)
library(lme4)
library(MASS, exclude = "select")

You have intelligent rats

(We analyzed these data before; this time, we are going to run a mixed model.)

Each of 12 students trained rats to run a maze, and the number of successful runs (out of 50 attempts) was recorded on each of 5 days. Some of the students were told that the rats they were training were “bright” (intelligent) and some of them were told that their rats were “dull” (not intelligent). In actual fact, all the rats were from the same source, and none of them were any more intelligent than the others. Did it make a difference whether the students were told that their rats were intelligent on the number of mazes the rats successfully ran, and, if so, was the effect consistent over time? The same group of rats were trained by the same student throughout this experiment, so it makes sense to treat the data as repeated measures.

The data are in http://ritsokiguess.site/datafiles/intelligent-rats.csv. The columns of interest to us are:

  • Student: a numerical identifier for each student
  • Treatment: what the student was told about their rats
  • Day1 through Day5: the number of successful runs on each day.

There are some other variables that will not concern us here.

  1. Read in and display (some of) the data.

Nothing unexpected here:

my_url <- "http://ritsokiguess.site/datafiles/intelligent-rats.csv"
maze <- read_csv(my_url)
Rows: 12 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): Treatment
dbl (11): Student, PriorExp, Block, Day1, Day2, Day3, Day4, Day5, Relax, Suc...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
maze
  1. Draw a suitable interaction plot (for treatment and time). How does this clarify your conclusions of the previous part, meaning on last week’s worksheet? (Hint: you’ll need longer data. If you want to be consistent with me, use Day for the names of the days, and runs for the numbers of runs.) (This is a repeat of the last worksheet, but you will need longer data for what follows.)

The hint says that the first thing we need to do is to make “longer” data, with each observation on one row:1

maze %>% 
  pivot_longer(starts_with("Day"), names_to = "Day", values_to = "runs") -> maze_long
maze_long

Then work out the mean number of runs for each combination of Treatment and Day:

maze_long %>% 
  group_by(Treatment, Day) %>% 
  summarize(mean_runs = mean(runs)) -> means 
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by Treatment and Day.
ℹ Output is grouped by Treatment.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(Treatment, Day))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.
means

And then make the interaction plot, bearing in mind that you need a colour and a group and they need to be the same:

ggplot(means, aes(x = Day, y = mean_runs, colour = Treatment, group = Treatment)) + 
  geom_point() + geom_line()

(You don’t need to display the intervening dataframes, though you probably should for yourself while you’re working on this.)

When the students were told that their rats were “bright”, the number of successful runs on average increased over time. But for the students told that their rats were “dull”, the mean number of successful runs increased up to day 3, peaked there, and decreased afterwards. You might also say that the traces over time are not parallel, and talk about how they are not.

This difference in pattern over time was the reason for the significant interaction.

Extra: I also drew a spaghetti plot, which uses the same long data that we just made, but has a colour and group that are different. The lines on this one need to join observations that come from the same student, so that’s what goes in group:

ggplot(maze_long, aes(x = Day, y = runs, colour = Treatment, group = Student)) + 
  geom_point() + geom_line()

The pattern is not so clear here (which is why I had you draw the interaction plot instead), but if you look carefully, you can that several of the students with the supposedly “dull” rats had the number of successful runs peak at day 3, while the students with “bright” rats saw the number of runs on a more or less upward trajectory. There is some variability here, but the different patterns over time may be consistent enough to make the interaction significant (in the mixed model that we are about to fit).

The red traces are mostly above the blue ones (except perhaps at day 3), which is probably the reason there is a significant treatment main effect, but we would do better to look at simple effects over time to further understand the significant interaction (the treatment effect will probably not be consistent over time, as a consequence of the significant interaction).

  1. Do a mixed model analysis (hint: use the long data that you made for your interaction plot.)

Let’s remind ourselves of how the longer data looked:

maze_long

Make sure you have lme4 installed and loaded.

The response variable is runs. That depends on Treatment and time (Day) and their interaction. In addition, you see that the first five measurements in the dataframe are from the rats of the same student (student 1), and because they are trained by the same student, we expect the numbers of runs from those rats to be correlated. This is incorporated in the analysis by adding a “random effect” for Student. In the jargon of mixed models, this is more precisely called a “random intercept”: it says that the mean number of runs is moved up or down for the particular student that trained those rats, the same for each day:

maze_long.1 <- lmer(runs ~ Treatment * Day + (1|Student), data = maze_long)

Then we run drop1 to see if we can get rid of any of the fixed (non-random) effects. The test is Chisq because this model is based on maximum likelihood not least squares:

drop1(maze_long.1, test = "Chisq")

The interaction between Treatment and Day is significant, so that is our finding: this is where we stop.

Extras:

  • The 1 in the code 1|Student is the R notation for “intercept”, so the code 1|Student means “a different intercept for each Student”.
  • Random effects, such as the random intercepts for students here, never get tested. We are allowing for the possibility that the students are different from each other, which is something we can estimate because we have repeated observations on the same students. This is not of immediate interest to us, because the students are the ones the original researchers could get: in principle, a random sample of all possible students, and so we don’t care especially much about these students. You might be reminded of “blocks” in a randomized blocks analysis: we don’t care about the blocks themselves, but we expect them to be different from each other and thus to affect the outcome in addition to whatever treatments we are interested in.
  • You could also have a random effect of Student that increases over time, a so-called “random slope”. To do that, you pull out the number from Day and store it in, say, Day_number, and then write your random effect as Day_number|Student, which will introduce a random effect for each student that is a linear function of time.
  • Having repeated observations on the same subjects is new. In C32, we didn’t have that, because each subject only gave us one observation. For example, in a two-sample experiment, the subjects are divided at random and each gets only one of treatment or control. The exception is a matched-pairs experiment, where each subject has a “before” and an “after” (or something like that), and we rig up one observation for each subject by taking the difference between before and after.
  1. Compare your findings from the mixed model to the ones from the Manova analysis you did earlier (on last week’s worksheet).

The findings are exactly the same, albeit with a smaller P-value: there is a significant interaction between Treatment and Day, so the effect of treatment is different for the different days.

If you want to follow this up by doing simple effects of treatment for each day, the analysis is exactly the same as the one you did before, because when you look at each day separately, there is no repeated measures.2

Diabetes

According to the Mayo Clinic,

Diabetes mellitus refers to a group of diseases that affect how the body uses blood sugar (glucose). Glucose is an important source of energy for the cells that make up the muscles and tissues. It’s also the brain’s main source of fuel. The main cause of diabetes varies by type. But no matter what type of diabetes you have, it can lead to excess sugar in the blood. Too much sugar in the blood can lead to serious health problems.

The data in http://ritsokiguess.site/datafiles/diabetes1.csv are from 145 non-obese adult patients classified into three groups (types of diabetes): “normal”, “overt”, and “chemical”. For each patient, five other variables were also recorded:

  • rw: relative weight, the ratio of actual weight to ideal weight for the person’s height.
  • fpg: fasting plasma glucose
  • glucose: area under plasma glucose curve after 3-hour glucose tolerance test
  • insulin: area under plasma insulin curve after 3-hour glucose tolerance test
  • sspg: steady-state plasma glucose

These variables are recorded here as \(z\)-scores (they were originally measured on vastly different scales).

Our aim is to investigate any association between the five measured variables and the diabetes type (in group).

  1. Read in and display (some of) the data.

As ever:

my_url <- "http://ritsokiguess.site/datafiles/diabetes1.csv"
diabetes <- read_csv(my_url)
Rows: 145 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): group
dbl (5): rw, fpg, glucose, insulin, sspg

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
diabetes

145 rows, the variables promised, and values for those variables that do indeed look like \(z\)-scores.

Extra: discriminant analysis works best if all the quantitative variables are on about the same scale, so that the sizes of the “coefficients of linear discriminants” are roughly comparable (since we are going to look at them to decide what LD1, LD2, etc., contain). For the original data, this was very much not the case:

library(rrcov)
Loading required package: robustbase
Scalable Robust Estimators with High Breakdown Point (version 1.7-6)

Attaching package: 'rrcov'
The following objects are masked from 'package:lme4':

    getData, isSingular
data("diabetes")
diabetes

You can see that a one-unit change in rw is a big change relative to the kinds of values rw takes, but a one-unit change in glucose is a very small one. So I decided to re-express the quantitative variables as \(z\)-scores. The function is called scale, but if you feed it a column, it will give you back a matrix, for reasons that go back into the dusty early days of R. So I wrote a little function that takes as input a column, and returns the standardized version of that column to two decimal places. First I run scale, then I turn it from a matrix back into a vector, and then I round the vector to 2 decimals. The bottom line is a test:

my_scale <- function(x) {
  y <- scale(x)
  z <- as.vector(y)
  round(z, 2)
}
zz <- c(0, 2, 4, 6)
my_scale(zz)
[1] -1.16 -0.39  0.39  1.16

This seems to have worked. My zz has mean 3, and the values above the mean are as far above 3 as the ones below are below, so I should get two pairs of \(z\)-scores differing only in sign.

The idea is that the thing I’m doing on a bunch of columns is kind of messy, so I write a function to abstract away the messiness, and then my across (below) is much simpler.

Now that I have that set up, I run it on all the columns except for the categorical group. The mutate line below reads, in English, as “take all the columns except for group, and replace them by the standardized versions of themselves, to 2 decimals”:

diabetes %>% 
  mutate(across(-group, \(x) my_scale(x))) -> diabetes
diabetes

This is the data I saved for you.

  1. Using manova, demonstrate that the group has some kind of effect on the other variables.

Create a response matrix first, out of the quantitative columns. You have two options here, the base-R way that you will probably think of first:

response <- with(diabetes, cbind(rw, fpg, glucose, insulin))

Or the tidyverse-till-the-end way, where everything stays as a dataframe until the end:

diabetes %>% select(-group) %>% 
  as.matrix() -> response1

I’m not displaying either response matrix, because they have 145 rows (though I did for myself earlier to make sure they looked right).

Then run the MANOVA:

diabetes.1 <- manova(response ~ group, data = diabetes)
summary(diabetes.1)
           Df Pillai approx F num Df den Df    Pr(>F)    
group       2 1.2113   53.756      8    280 < 2.2e-16 ***
Residuals 142                                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The P-value is very small, so one or more of the groups has some effect on one or more of the quantitative variables. As you recall, at this point, we have no idea what kind of effect, or on what, but we do know that there is something to find. If this were ANOVA rather than MANOVA, we would at this point run Tukey to see what kind of differences there were.

  1. Run a discriminant analysis and display the results.

This part is not much work, having made sure you have MASS loaded:

diabetes.2 <- lda(group ~ rw + fpg + glucose + insulin + sspg, data = diabetes)
diabetes.2
Call:
lda(group ~ rw + fpg + glucose + insulin + sspg, data = diabetes)

Prior probabilities of groups:
 chemical    normal     overt 
0.2482759 0.5241379 0.2275862 

Group means:
                  rw        fpg    glucose    insulin       sspg
chemical  0.60694444 -0.3550000 -0.1569444  0.8419444  0.2336111
normal   -0.31000000 -0.4815789 -0.6110526 -0.1111842 -0.6622368
overt     0.05060606  1.4957576  1.5787879 -0.6627273  1.2700000

Coefficients of linear discriminants:
                LD1        LD2
rw       0.17691142  0.4883370
fpg     -2.15320564 -2.3530795
glucose  3.98629010  2.2550693
insulin -0.01269006  0.7451996
sspg     0.44658743 -0.1192412

Proportion of trace:
  LD1   LD2 
0.881 0.119 

In a discriminant analysis, the categorical variable (or combination of the categorical variables) goes on the left, and the quantitative variables go on the right, so that we are “predicting category”, even if logically the category is really influencing the quantitative variables rather than the other way around.

Extra 1: this worksheet is likely to get long, so I won’t ask it here, but I often ask whether the number of linear discriminants you got is what you were expecting. Here, we had 3 groups and 5 variables, so we should expect the smaller of \(3-1\) and 5, that is 2, LDs, and that’s what we got.

Extra 2: depending specifically on your R installation, you might get output with negative and positive signs switched around compared to mine. This may make your graphs (later) come out upside down or mirror-image compared to mine, and may have other effects, but as long as you appear to have done the right thing, you are good (that is, if the grader on an assignment cannot find any errors in your code, it is no problem if your output differs from mine in these kinds of ways.)

  1. Comment briefly on the relative importance of the linear discriminants.

Look at the “proportion of trace” at the bottom of the output. The value for LD1, 0.881, is much larger than the value for LD2, 0.119, so the first linear discriminant is much more important than the second, and does most, but not all, of the work in separating the three groups.

Your choice of adjective; LD2 is not quite worthless, but we will see on a graph later that most of the separation is happening along the LD1 dimension.

  1. Which two of the original quantitative variables play the largest role in LD1? What kind of values on those variables would make the LD1 score large (very positive)?

This is, for me, fpg negatively and glucose positively; thus, the LD1 score will be largest if glucose is large and/or fpg is small. (I am happy if you say “and” here, though things are actually a bit more complicated than that; see the Extra below.)

Extra: We will see later that this is characteristic of an individual with overt diabetes. Having said that, such individuals tend to have large glucose and large fpg. The rationalization of this is that the coefficient on LD1 for glucose is almost twice as big in size as the one for fpg, so that a large value of glucose “wins” over a large value of fpg. I drew some graphs for myself at the end to understand this better.

You might have the signs switched around, so the answer here would then be “small glucose and large fpg”.

  1. Obtain and save a dataframe containing the predicted group memberships, posterior probabilities, and discriminant scores for each individual, along with the original data. Display (some of) your dataframe.

This is obtained with the old-fashioned predict. predict only produces the predictions, so we need to glue this onto the original data:

p <- predict(diabetes.2)
d <- cbind(diabetes, p) 
d

Most of the columns I see here came from the original data, but class is the predicted group (from the discriminant analysis, based only on the values of the five quantitative variables). If you scroll right, you’ll see the posterior probabilities of each individual being in each group, and at the end the LD scores (on LD1 and LD2).

  1. Obtain a table counting the number of individuals who actually had each type of diabetes, cross-classified by the type of diabetes they were predicted to have. Does the classification appear to be good or bad? Explain briefly.

This is the standard way to do it:

with(d, table(obs = group, pred = class))
          pred
obs        chemical normal overt
  chemical       30      6     0
  normal          3     73     0
  overt           5      1    27

You also have a tidyverse option, with count:

d %>% count(group, class)

but this, I think, is best made to look like the previous output:

d %>% count(group, class) %>% 
  pivot_wider(names_from = class, values_from = n, values_fill = 0)

The pivoting-wider introduces some cells that didn’t appear in the original count output (which had only seven rows); the values_fill replaces those missings with the zero that they actually are. You won’t always need this; try it without first, and if you get some missings that should be zeros, put it in.

However you get the table, look at the values off the top-left to bottom-right diagonal; these are the ones that the classification got wrong. There are not many of these here, only \(6+3+5+1 = 15\) out of 145. So the classification appears to be good (which implies that the three types of diabetes can be distinguished well by the five quantitative variables, in particular by glucose and fpg that were the most important part of LD1).

If you have the table that comes directly out of count(group, class), you’ll need to say something about how you know the classification was a good one, for example words like “the rows on which the actual group and the predicted group (in class) are different are the ones for which the classification was wrong. The frequencies on these rows are small, so not many of the individuals were wrongly classified”.

  1. Find an individual that was misclassified (it doesn’t matter which one). For your chosen individual, was the misclassification clear-cut or a close thing? Explain briefly.

The true type of diabetes is in group and the predicted type is in class, so you can first find the individuals for which group and class are different:

d %>% filter(group != class)

Then, whichever individual you pick, scroll across and find their posterior probabilities of being in the true and predicted types. For example, the individual in row 26 was actually normal but was predicted to be chemical. This individual has a posterior probability 0.82 of being chemical, only 0.17 of being normal, and almost 0 of being overt. So this decision was actually quite clearly wrong.

Go through this procedure with your chosen individual. The answer you get doesn’t matter as long as it follows logically, and will depend on your chosen individual. The one in row 59, for example, is a close call (0.47 and 0.53), and the one in the row labelled 66 (on mine) is an even closer call. (The row numbers come from the original dataframe d before the filter was run on it.)

  1. Make a plot of LD1 and LD2 scores for each individual, distinguished by the group they belong to. There are too many points on this plot to label individually.

Just this:

ggplot(d, aes(x = x.LD1, y = x.LD2, colour = group)) + geom_point()

If you try to label the points (like I did on the peanuts example in class), the plot will be mostly an effort to find places to put labels, so don’t do that.

Look in your dataframe of data plus predictions (the one I called d) to see what the discriminant scores are called. It depends on precisely how you made your dataframe. On mine, they are called x.LD1 and x.LD2, but on yours they might be called LD1 and LD2. Take a look.

Extra: to understand why that happened for me, let’s look at the “structure” of my p:

str(p)
List of 3
 $ class    : Factor w/ 3 levels "chemical","normal",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ posterior: num [1:145, 1:3] 0.007919 0.000234 0.000362 0.0546 0.019599 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:145] "1" "2" "3" "4" ...
  .. ..$ : chr [1:3] "chemical" "normal" "overt"
 $ x        : num [1:145, 1:2] -1.7 -2.84 -2.63 -1.51 -1.86 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:145] "1" "2" "3" "4" ...
  .. ..$ : chr [1:2] "LD1" "LD2"

This is actually a list, with things inside other things. For example, the posterior probabilities of each group are all in one object called posterior, with three things called chemical, normal, and overt inside that. The cbind I did earlier squashes this list down into a dataframe, and the nested structure is lost: chemical inside posterior becomes a column called posterior.chemical, and LD1 within x becomes x.LD1. Check what I called d to see that this is exactly what happened.

  1. Which group is on the right on your plot? What does that say about this group’s values on the original quantitative variables?

On my graph, the overt group is on the right. These have high LD1 scores, and as we said earlier, these individuals have large glucose and/or small fpg. Your graph might have come out as the mirror image, in which case you might have the normal group on the right, and they have small glucose and (possibly) large fpg.

Extra: these questions spiral rapidly out of hand if I ask you to do everything. The one thing I didn’t ask you to do was to make a biplot, which is actually easier to make than the LD scores plot, once you have everything set up:

library(ggbiplot)
Loading required package: plyr
------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
------------------------------------------------------------------------------

Attaching package: 'plyr'
The following objects are masked from 'package:dplyr':

    arrange, count, desc, mutate, rename, summarise, summarize
The following object is masked from 'package:purrr':

    compact
Loading required package: scales

Attaching package: 'scales'
The following object is masked from 'package:purrr':

    discard
The following object is masked from 'package:readr':

    col_factor
Loading required package: grid
library(conflicted)
conflicts_prefer(dplyr::mutate)
[conflicted] Will prefer dplyr::mutate over any other package.
conflicts_prefer(dplyr::count)
[conflicted] Will prefer dplyr::count over any other package.
conflicts_prefer(dplyr::filter)
[conflicted] Will prefer dplyr::filter over any other package.

The ggbiplot package uses a package called plyr that has a lot of functions with the same names as ones in dplyr (the database-type functions in the tidyverse). It is a good idea to load the conflicted package here and explicitly choose the dplyr functions when conflicted offers you a choice, using conflicts_prefer. These are the ones I needed; you might need others.

ggbiplot(diabetes.2, group = diabetes$group)

The value this has is that we see the original variables as well, so we get confirmation that the overt group is high on glucose and (apparently) low on fpg, and the other three variables don’t do much to distinguish the groups.

Only two of our original quantitative variables seem to have any difference between groups, so we are in the fortunate position of being able to plot those two against each other, with groups labelled by colour:

ggplot(diabetes, aes(x = glucose, y = fpg, colour = group)) + geom_point()

This is where I saw that glucose and fpg actually have a high positive correlation, so that a high value on one almost always goes with a high value on the other. This seems to contradict what the biplot and question 9 were saying, where a high LD1 score seems to go with high glucose and low fpg. I saw this last night, when I was very tired, and figured I’d better come back to it this morning with coffee in me and explain it properly for you. So I went back today and qualified my answer to question 9 (the “and/or”, and the Extra there), so that what is actually happening is that the high glucose is “winning” over the high fpg because its coefficient in LD1 is bigger in size.

You can see that glucose by itself does a pretty good job of distinguishing the groups, and you might be wondering why fpg is part of LD1 at all. But if you look carefully, you’ll see that having a value of fpg above zero identifies the overt group and distinguishes it from the chemical group individuals whose glucose is above 0. So it is worth looking at fpg as well as glucose.

In an analysis like this, if you can get it down to individual important quantitative variables, you can see how well they distinguish the groups by drawing boxplots:

ggplot(diabetes, aes(x = group, y = glucose)) + geom_boxplot()

and

ggplot(diabetes, aes(x = group, y = fpg)) + geom_boxplot()

and you can even make a little table that shows which variable does the better job of distinguishing each pair of groups:

Group 1 Group 2 better variable
chemical normal glucose
chemical overt fpg
normal overt both

and you see that both variables matter for distinguishing the groups. In this case, the use of glucose and fpg to distinguish the groups is simple enough that looking at the variables one at a time tells you how the groups are different. Under other circumstances it might have been that a combination of values of the two variables was what was important, and then the scatterplot of the two variables with coloured groups would have been the thing to look at.

This last piece of the analysis seems very simple, scatterplots and boxplots. But it took the discriminant analysis to tell us which variables to look at.

Footnotes

  1. The definition of “tidy data” gets a bit blurry for repeated-measures data. Is “one observation” one value of a number of mazes run, with another column saying which student and day it was (longer), or is it the observations for all five days for the same student (wider)? You could make a case for either answer.↩︎

  2. If you wanted to condition on treatment and look for a time effect for each treatment, that would still be repeated measures, and to be consistent here, you would use a mixed model for that as well.↩︎