STAD29 Assignment 7

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

Packages

library(tidyverse)
library(car)
library(MASS, exclude = "select")
library(lme4)

Improving self-esteem

34 subjects participated in a program designed to improve their self-esteem. Each subject was randomly allocated to a treatment regime: Diet, a special diet; DietEx, the special diet plus an exercise program; or Control, no special diet or exercise. At the end of each of the three months of the program, each subject’s self-esteem was measured using a standard questionnaire, as recorded in the columns se1 through se3. The data are in http://datafiles.ritsokiguess.site/self-esteem.csv. The researchers were interested in which of the programs helped to increase self-esteem, and whether there were any trends over time.

(1 point) Read in and display (some of) the data.

Somehow I think you guessed this was coming:

my_url <- "http://datafiles.ritsokiguess.site/self-esteem.csv"
self_esteem <- read_csv(my_url)

Rows: 34 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): regime
dbl (7): subject, wl1, wl2, wl3, se1, se2, se3

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

self_esteem

Extra: the columns wl1 through wl3 were actually recorded values of weight loss at the same times as the self-esteem surveys were done. But we ignore that, since two repeated-measures variables is more than we want to handle.

(2 points) How do you know that we will be doing a repeated measures analysis of these data?

Each subject (in a row of the dataframe) produces three measurements, namely the self-esteem scores at the three times. These are the “repeated measures”.

Extra: the repeated measures analysis will be necessary because the three scores on the same subject may well be correlated:

self_esteem %>% select(starts_with("se")) %>% cor()

          se1       se2       se3
se1 1.0000000 0.6216745 0.5685528
se2 0.6216745 1.0000000 0.1976336
se3 0.5685528 0.1976336 1.0000000

In particular, the self-esteem score at the first time point is positively correlated with the scores at time points 2 and 3 for the same subject.

(5 points) Run a suitable repeated measures analysis for these data to see whether self-esteem score depends on regime, time, or a combination of both, and display the output. (Interpretation is coming up, so don’t do it here; also, the output will be long but for this assignment you can display all of it.)

There are several steps:

make your response matrix
set up the time-dependence stuff
run your model as an lm first
feed that into Manova with the time-dependence stuff
display the summary of this.

Guideline for marker: one point for each of these steps done correctly.

So, to make the response matrix, you can either use cbind or you can go tidyverse-first. I like the second way here because the columns that are going to make the response matrix have similar names and you can use select to get them:

self_esteem %>% select(starts_with("se")) %>% 
  as.matrix() -> response
head(response)

     se1 se2 se3
[1,]  14  13  15
[2,]  13  14  17
[3,]  17  12  16
[4,]  11  11  12
[5,]  16  15  14
[6,]  17  18  18

If you want to display it for checking, remember that a matrix will display all of itself (34 rows here) unless you stop that from happening. head will display only the top 6 lines.

Then, the time stuff, which is best set up so that it is exactly the same every time:

times <- colnames(response)
times.df <- data.frame(times = factor(times))

You can, if you wish, verify that times is a vector containing the names of the three response columns, and times.df is a dataframe containing those same three names in a column called times.

(Aside: you don’t otherwise need to rearrange the data, because the Manova analysis uses the wider format that we have.)

Next, run this as an lm:

self_esteem.1 <- lm(response ~ regime, data = self_esteem)

There is no particular value in looking at this one; the value of this analysis is the stuff it calculates and saves behind the scenes, to be used by Manova below.

Next, the right Manova. This is where the time stuff comes in:

self_esteem.2 <- Manova(self_esteem.1, idata = times.df, idesign = ~ times)

I am never sure about a good numbering for these; they are really two parts of the same model, so maybe it makes more sense to number them 1a and 1b instead. Your call.

Finally, the summary (of the second one):

summary(self_esteem.2)


Type II Repeated Measures MANOVA Tests:

------------------------------------------
 
Term: (Intercept) 

 Response transformation matrix:
    (Intercept)
se1           1
se2           1
se3           1

Sum of squares and products for the hypothesis:
            (Intercept)
(Intercept)    68670.12

Multivariate Tests: (Intercept)
                 Df test stat approx F num Df den Df     Pr(>F)    
Pillai            1   0.98796 2543.895      1     31 < 2.22e-16 ***
Wilks             1   0.01204 2543.895      1     31 < 2.22e-16 ***
Hotelling-Lawley  1  82.06113 2543.895      1     31 < 2.22e-16 ***
Roy               1  82.06113 2543.895      1     31 < 2.22e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------
 
Term: regime 

 Response transformation matrix:
    (Intercept)
se1           1
se2           1
se3           1

Sum of squares and products for the hypothesis:
            (Intercept)
(Intercept)    21.06569

Multivariate Tests: regime
                 Df test stat  approx F num Df den Df  Pr(>F)
Pillai            2 0.0245554 0.3901908      2     31 0.68021
Wilks             2 0.9754446 0.3901908      2     31 0.68021
Hotelling-Lawley  2 0.0251736 0.3901908      2     31 0.68021
Roy               2 0.0251736 0.3901908      2     31 0.68021

------------------------------------------
 
Term: times 

 Response transformation matrix:
    times1 times2
se1      1      0
se2      0      1
se3     -1     -1

Sum of squares and products for the hypothesis:
          times1   times2
times1  56.94118 104.8235
times2 104.82353 192.9706

Multivariate Tests: times
                 Df test stat approx F num Df den Df     Pr(>F)    
Pillai            1 0.4899083 14.40648      2     30 4.1184e-05 ***
Wilks             1 0.5100917 14.40648      2     30 4.1184e-05 ***
Hotelling-Lawley  1 0.9604317 14.40648      2     30 4.1184e-05 ***
Roy               1 0.9604317 14.40648      2     30 4.1184e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------
 
Term: regime:times 

 Response transformation matrix:
    times1 times2
se1      1      0
se2      0      1
se3     -1     -1

Sum of squares and products for the hypothesis:
         times1   times2
times1 23.32549 40.04314
times2 40.04314 68.76275

Multivariate Tests: regime:times
                 Df test stat approx F num Df den Df    Pr(>F)   
Pillai            2 0.2574779 2.290306      4     62 0.0696268 . 
Wilks             2 0.7425563 2.407105      4     60 0.0592408 . 
Hotelling-Lawley  2 0.3466530 2.513235      4     58 0.0512638 . 
Roy               2 0.3465199 5.371058      2     31 0.0099358 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Univariate Type II Repeated-Measures ANOVA Assuming Sphericity

              Sum Sq num Df Error SS den Df   F value    Pr(>F)    
(Intercept)  22890.0      1   278.94     31 2543.8949 < 2.2e-16 ***
regime           7.0      2   278.94     31    0.3902  0.680205    
times           96.7      2   134.58     62   22.2807 5.111e-08 ***
regime:times    34.7      4   134.58     62    3.9962  0.006003 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Mauchly Tests for Sphericity

             Test statistic  p-value
times               0.75295 0.014173
regime:times        0.75295 0.014173


Greenhouse-Geisser and Huynh-Feldt Corrections
 for Departure from Sphericity

              GG eps Pr(>F[GG])    
times        0.80189  7.596e-07 ***
regime:times 0.80189    0.01105 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                HF eps   Pr(>F[HF])
times        0.8389008 4.583464e-07
regime:times 0.8389008 9.855788e-03

This is the lengthy output that you need for the next question.

Extra: note that there is no “time” in the model self_esteem.1, but there is a times term in the output. In this kind of model, time sneaks in the back door, which is why we are unable to remove time if the interaction turns out not to be significant.

(3 points) What do you conclude from your analysis, in the context of the data, using the output that you just obtained? Describe the process that you followed to get to your conclusions.

Check sphericity first, from the Mauchly tests near the bottom. We are first assessing the regime by time interaction, so we strictly need the P-value from the second test (though they are both the same). The P-value of 0.014 is smaller than 0.05, so sphericity fails and we cannot use the type II analysis for anything time-related.

Next, look at the Huynh-Feldt tests at the bottom. The P-value for the interaction is 0.0099; this is also less than 0.05, so the interaction is significant: the effect of treatment regime depends on time (that is, it is not consistent over time).

Next: STOP! Do not attempt to interpret the main effects now; we have seen that the effect of regime depends on time, so it makes no sense to claim that there is “an” effect of regime that works for all times, or an effect of time that is valid for all treatment regimes. Expect to lose points if you try to interpret main effects having found that the interaction is significant.

Extra: when sphericity fails, as here, the P-values for time-related things (the interaction and the time main effect) in the Type II tests are too small: in other words, when sphericity fails, if you look at the Type II tests, you will think that the results are more significant than they really are. The work that Huynh and Feldt did tells us how we can adjust the P-values for anything time-related when sphericity fails: that is to say, how much bigger to make the P-values. In this case, the P-value for the interaction goes up from 0.0060 (type II test) to 0.0099 (Huynh-Feldt). This is not a very big change, and not enough to stop the interaction from being significant. This is because the sphericity test, with a P-value of 0.014, was significant without being strongly significant. If the sphericity had failed more dramatically, the adjustment to the P-value would have been bigger, and we might have found that the interaction was no longer significant.

(3 points) Make an interaction plot that illustrates how the mean self-esteem values vary over time and regime. Hint: do you need to rearrange the data?

The first thing is to say that you do need to rearrange the data, because you need it longer for a graph:¹

self_esteem %>% 
  pivot_longer(starts_with("se"), 
               names_to = "time", 
               values_to = "self_esteem") -> self_esteem_long
self_esteem_long

If you want to (very much optional), you can use parse_number to pull out the numeric time values from what I called time above. It still works whether you do this or not.²

Now that you have this (and make sure you save it because you will need it again in a minute), you can either try to figure out the stat_summary thing in the lecture notes, or you can recognize that this is an ordinary interaction plot with one of the variables being time, so the first step is a group-by and summarize:

self_esteem_long %>% 
  group_by(regime, time) %>% 
  summarize(mse = mean(self_esteem))

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by regime and time.
ℹ Output is grouped by regime.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(regime, time))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

to get the mean self-esteem score for each combination of regime and time (averaged over the subjects that were in that regime and at that time). Save this if you like, or pipe it into the plot:

self_esteem_long %>% 
  group_by(regime, time) %>% 
  summarize(mse = mean(self_esteem)) %>% 
  ggplot(aes(x = time, y = mse, colour = regime, group = regime)) +
  geom_point() + geom_line()

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by regime and time.
ℹ Output is grouped by regime.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(regime, time))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

For an interaction plot, you need a colour and a group, and they are the same column.

(2 points) How is your interaction plot consistent with the conclusion from the repeated measures analysis?

The regimes do not compare consistently over time. For example, the diet plus exercise treatment has highest mean self-esteem at time points 1 and 3, but the lowest at time 2. Or compare the patterns over time: the control regime only dips slightly at time 2, but both the real treatments both result in a bigger drop in self-esteem at time 2 and a much bigger increase at time 3. In some way, make the point that it is actually the combination of regime and time that predicts self-esteem.

Extra: you might also think of making a spaghetti plot, which goes like this (remembering that you need a colour and a group and now they are different):

ggplot(self_esteem_long, aes(x = time, y = self_esteem, colour = regime, group = subject)) +
  geom_point() + geom_line()

I find this a good deal harder to read than the interaction plot. Partly this is because the self-esteem scores are all whole numbers, so you get multiple subjects with the same scores, and it’s harder to see where the traces are going. The clearest thing I see here is that the blue traces (diet plus exercise) are tending to go down and then sharply up again. The green ones seem to be all over the place.

Hang on, let me try something:

self_esteem_long %>% 
  mutate(self_esteem = jitter(self_esteem)) %>% 
  ggplot(aes(x = time, y = self_esteem, colour = regime, group = subject)) +
    geom_point() + geom_line()

The technique called “jittering” moves an observation slightly away from its actual value (by adding a small random quantity) so that observations that had the same value before now have slightly different ones. Normally, to use this in a graph you use geom_jitter instead of geom_point, but the problem here is that we want to join the jittered points by lines, and geom_line only knows where the points were before, not where they got moved to after jittering. So I need to generate jittered points first, and then plot them joined by lines. There is a function called jitter that does exactly this, so I replace (temporarily) the self-esteem scores by the jittered versions of themselves, and then plot those joined by lines.

I like this spaghetti plot a lot better. On this one, you can see that the green traces at least sometimes go back up again after time 2, and it is clearer that the red ones sometimes go up and sometimes down, but on average there is no real trend in the red traces.

(3 points) Fit a suitable mixed model, and demonstrate that your conclusions from it are the same as from the repeated measures analysis of variance.

This is actually a lot easier than the repeated measures ANOVA, both modelling and interpreting.

The model-fitting is this:

self_esteem.3 <- lmer(self_esteem ~ time * regime + (1 | subject), 
                      data = self_esteem_long)

In words, this says that self esteem score depends on time and regime and their combination, plus a random effect³ for subject. The correlation between self-esteem scores within a subject is taken care of by allowing each subject to have their own random effect that moves all their scores up (or down) across all times, like a block effect in ANOVA.

The summary of this kind of model is not very illuminating, but drop1 works just fine:

drop1(self_esteem.3, test = "Chisq")

and this says that once again, the interaction is significant: the self-esteem score depends on the regime and on the time point in combination, as we concluded before.

Rain forest leaf shapes

There are rain forests all over the world. Is it possible to tell where in the world a leaf from a rain forest tree came from, based on the leaf’s measurements?

The data in http://datafiles.ritsokiguess.site/leafshape.csv are measurements of (logarithm of) length, width, and petiole length of 206 leaves collected from rain forests in Panama and Costa Rica (combined into C America in the data), North and South Queensland in Australia (combined into Queensland in the data), and Tasmania in Australia. (The “petiole” is what you might know of as the “stalk” or “stem” of a leaf.)

(1 point) Read in and display (some of) the data.

This, or something similar:

my_url <- "http://datafiles.ritsokiguess.site/leafshape.csv"
leafshape <- read_csv(my_url)

Rows: 206 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): location
dbl (3): logwid, logpet, loglen

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

leafshape

Extra: These data are based on the leafshape data from the DAAG package. There were originally six locations, also including Sabah in Malaysia, but the leaves from Sabah were very diverse and easily confused with leaves from other locations. Also, the leaves from the two Central America locations were very similar, and the leaves from North and South Queensland were very similar, so I combined those locations to make it clearer what was going on. There were also some other columns that I omitted:

leafshape0 <- DAAG::leafshape
leafshape0

The new dplyr (version 1.2.0) has a function recode_values that will do the combining we want (explained below). Before that, I get rid of the other columns:

leafshape0 %>% 
  select(starts_with("log"), location) %>% 
  rename(location_old = location) %>% 
  mutate(location = recode_values(
    location_old, 
    "Panama" ~ "C America",
    "Costa Rica" ~ "C America",
    "N Queensland" ~ "Queensland",
    "S Queensland" ~ "Queensland",
    "Tasmania" ~ "Tasmania"
  ))

Then:

rename the original location column so that I can use the name location for the new one I’m going to make
create a new column called location that is built from the old one by converting the old name (on the left of the squiggle) to a new one (on the right)
any names on the left that are not matched are replaced by an NA. At the top of this display, the new location for Sabah is missing, because I didn’t want to use the originally-Sabah data.

Next, I get rid of the rows that have a missing location (ie., were from Sabah originally) and get rid of the old location column:

leafshape0 %>% 
  select(starts_with("log"), location) %>% 
  rename(location_old = location) %>% 
  mutate(location = recode_values(
    location_old, 
    "Panama" ~ "C America",
    "Costa Rica" ~ "C America",
    "N Queensland" ~ "Queensland",
    "S Queensland" ~ "Queensland",
    "Tasmania" ~ "Tasmania"
  )) %>% 
  drop_na(location) %>% 
  select(-location_old)

and these are the 206 rows of data I saved for you.

(2 points) Run a suitable MANOVA and display the results.

There are three response variables (the quantitative ones beginning with log) and one categorical explanatory variable (location), on the grounds that the location has, or may have, an effect on the dimensions of the leaf.

So, first create the response matrix:

leafshape %>% 
  select(-location) %>% 
  as.matrix() -> response
head(response)

        logwid     logpet   loglen
[1,] 0.9082586 -1.5153831 1.908060
[2,] 1.1442228 -0.9425834 2.041220
[3,] 0.9669838 -1.2485518 2.075684
[4,] 1.3297240 -0.9672104 2.177022
[5,] 1.3737156 -0.8319863 2.244956
[6,] 1.4134230 -1.2288121 2.294553

If you prefer, use cbind, but then you have to type out the names of the columns. When there is something that will enable you to choose the columns without naming them (here, “everything except location”), I prefer to take advantage of that and then turn it into a matrix at the end.

Then run manova (easiest, though if you can get the same result from Manova that’s also good):

leafshape.1 <- manova(response ~ location, data = leafshape)
summary(leafshape.1)

           Df  Pillai approx F num Df den Df    Pr(>F)    
location    2 0.40199   16.938      6    404 < 2.2e-16 ***
Residuals 203                                             
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusions coming up.

(2 points) What do you conclude from your MANOVA, in the context of the data?

The P-value of \(2.2 \times 10^{-16}\) is (much) less than 0.05, so the location does have an effect on one or more of the response variables, or some combination of them.

Don’t try to be any more precise than this. Conclusions from MANOVA are always vague (even vaguer than from the \(F\)-test in an ANOVA), and we will be doing a discriminant analysis (coming up) in an attempt to find out what is different about the leaves from the different locations.

(2 points) Run a suitable discriminant analysis and display the results.

Remember that a discriminant analysis is “logically backwards”: the categorical variable is treated as the response (even though it is actually explanatory) and the quantitative variables are treated as explanatory (even though they are really responses):

leafshape.2 <- lda(location ~ loglen + logpet + logwid, data = leafshape)
leafshape.2

Call:
lda(location ~ loglen + logpet + logwid, data = leafshape)

Prior probabilities of groups:
 C America Queensland   Tasmania 
0.50970874 0.44660194 0.04368932 

Group means:
             loglen     logpet    logwid
C America  2.919594  0.2990920 1.9405839
Queensland 2.506848 -0.1043243 1.4939589
Tasmania   1.351106 -0.9574345 0.1837469

Coefficients of linear discriminants:
              LD1       LD2
loglen -1.0037464  1.884861
logpet  0.8477512  1.272067
logwid -1.8403529 -2.656167

Proportion of trace:
   LD1    LD2 
0.9944 0.0056

Extra: you can get away without typing the names of all the quantitative variables by doing something like this:

leafshape.2a <- lda(location ~ . , data = leafshape)
leafshape.2a

Call:
lda(location ~ ., data = leafshape)

Prior probabilities of groups:
 C America Queensland   Tasmania 
0.50970874 0.44660194 0.04368932 

Group means:
              logwid     logpet   loglen
C America  1.9405839  0.2990920 2.919594
Queensland 1.4939589 -0.1043243 2.506848
Tasmania   0.1837469 -0.9574345 1.351106

Coefficients of linear discriminants:
              LD1       LD2
logwid -1.8403529 -2.656167
logpet  0.8477512  1.272067
loglen -1.0037464  1.884861

Proportion of trace:
   LD1    LD2 
0.9944 0.0056

This works because . means “all the other variables, not including location”. You can check that the numbers in the output are identical this way.

(2 points) What would result in a leaf having a high score on LD1? Explain briefly.

Look at the Coefficients of Linear Discriminants. The three coefficients are of roughly similar sizes, so I would talk about them all. The ones for logwid and loglen are negative and the one for logpet is positive, so a leaf will have a large LD1 score if:

the width is small
the petiole length is large
the length is small

that is to say, a small leaf in overall size with a long stalk. (Note: if the log of something is large or small, the thing itself is also large or small, so you don’t need to talk about logs here. On the other hand, you do need to translate loglen etc. for your reader. Unless they were the ones that recorded the data, they won’t know what loglen etc. mean.)

Alternatively, you might take the view that the coefficient of logwid is larger than the others in size, in which case you would say that a large LD1 score goes with a small width, only. This is what the “explain briefly” is there for: make a choice, but say why you made that choice.

Extra: looking at the group means, the leaves from Tasmania have the smallest width and length, so it looks as if those leaves may be the ones with largest LD1 scores. It’s not clear, because the leaves from Tasmania also have smallest petiole length, but we’ll see later whether that is correct.

(2 points) How many linear discriminants do you expect to be important? Explain (very) briefly.

Look at the Proportion of Trace at the bottom of the output. Almost all of it (99.4%) is explained by LD1 and almost none by LD2 (0.6%), so we would not expect LD2 to be important. (What that means is LD2 is expected to do pretty much nothing in distinguishing the groups.)

(2 points) Obtain a dataframe that contains the original data along with discriminant scores, predicted locations and posterior probabilities.

Something like this:

p <- predict(leafshape.2)
leafshape_p <- cbind(leafshape, p)
leafshape_p

Run predict (this is the same base R predict that we used in regression to get prediction intervals), and then glue the results onto the original data.

It is a good idea to display some of it so that you are convinced that you have actually answered the question. The check that you have done so is that you have:

a column called class with predicted locations,
columns with names beginning posterior that are the posterior probabilities of a leaf coming from that location given its measurements
columns with LD1 and LD2 in their names that are the scores on the linear discriminants.

(2 points) Make a table showing the actual and predicted locations and how many leaves were classified into each.

You have some choices here. I think the easiest one is this, using the base R table:

with(leafshape_p, table(location, class))

            class
location     C America Queensland Tasmania
  C America         79         26        0
  Queensland        29         61        2
  Tasmania           0          4        5

You can also use count, but this leaves you with some work to do to make it look nice:

leafshape_p %>% 
  count(location, class)

This has the same information as the above table, but is not so easy to read. What we really want is a table with (best) class as columns, which we can do like this:

leafshape_p %>% 
  count(location, class) %>% 
  pivot_wider(names_from = class, values_from = n)

The NA values here are really zero (note that the first run of count has only seven rows rather than nine, so two of the rows are missing). You can fix this as follows:

leafshape_p %>% 
  count(location, class) %>% 
  pivot_wider(names_from = class, values_from = n, values_fill = 0)

If you do this, say where you found out about values_fill.

(2 points) How distinguishable do the locations seem to be on the basis of the leaf measurements? Explain briefly.

Pick an adjective and say why you think that’s good. Mine is “somewhat”: most of the leaves have been classified correctly, but there are still a lot that are wrong. For example, the Central America and Queensland leaves are sometimes confused with each other, and only just over half of the leaves actually from Tasmania (bottom row) were actually classified as Tasmania.

I think your answer needs to convey how the leaves are in some ways distinguishable by location and in some ways not.

(2 points) Plot the discriminant scores against each other, with the points labelled by where each leaf actually came from.

Plot what (on my output) is called x.LD1 against x.LD2, using location as colour:

ggplot(leafshape_p, aes(x = x.LD1, y = x.LD2, colour = location)) +
  geom_point()

Extra: I guess I was right about Tasmania before: the blue points are mostly on the right.

(2 points) Does your plot (Question 17) tell a similar story to your misclassification table (Question 16)? Explain briefly.

Say something about how your adjective chosen in Question 16 makes sense, looking at the plot in Question 17.

My word “somewhat” is supported by the red points (Central America) being mostly on the left, the green ones (Queensland) being mostly in the middle, and the blue ones (Tasmania) being mostly on the right. But there is a lot of overlap, especially between Central America and Queensland (red and green), and some of the Queensland points (green) are far enough right that they look as if they came from Tasmania.

I am looking for something that conveys “partly distinct” and something that conveys “partly not distinct”

(2 points) How does your plot of Question 17 support your conclusion of Question 13? Explain briefly.

In Question 13 we said that only LD1 was important, meaning that only LD1 distinguishes the locations in any meaningful way. On the plot of Question 17, the locations are only distinguished left and right (in the LD1 direction) and not at all up and down (LD2 direction). So it is indeed true that only LD1 is important in distinguishing the locations.

(3 points) Find the leaf in row 105 (of your dataframe with LD scores and posterior probabilities). Is this leaf classified correctly? Find this leaf on your graph of Question 17, and explain why its posterior probabilities make sense.

Find it first:

leafshape_p %>% slice(105)

This is actually from Central America and was correctly classified as being from there (columns location and class).

To find this leaf on the plot, find its LD scores, by scrolling across or thus (for example):

leafshape_p %>% slice(105) %>% 
  select(contains("LD"))

This is one of the three leftmost points on the graph, and the only one with positive LD2 score, so it is the point towards the top left, in with the ones from Central America.

The posterior probabilities are:

leafshape_p %>% slice(105) %>% 
  select(starts_with("posterior"))

96% Central America, 4% Queensland, 0% Tasmania. Since the leaf is in the Central America part of the graph (on the left), it is no surprise that it is overwhelmingly likely to have been from Central America (and actually was).

Extra: this assignment is now Officially Too Long, so I won’t ask any more, but anyway: having a low LD1 score as this leaf does not only means that it is almost certainly from Central America, but it should be the opposite of what we found in Question 12: large leaf length and width, small petiole:

leafshape_p %>% 
  slice(105) %>% 
  select(starts_with("log"))

summary(leafshape)

     logwid           logpet             loglen          location        
 Min.   :-1.931   Min.   :-2.67123   Min.   :-0.1561   Length:206        
 1st Qu.: 1.314   1st Qu.:-0.62109   1st Qu.: 2.4079   Class :character  
 Median : 1.605   Median :-0.08378   Median : 2.6875   Mode  :character  
 Mean   : 1.664   Mean   : 0.06403   Mean   : 2.6667                     
 3rd Qu.: 1.979   3rd Qu.: 0.50471   3rd Qu.: 3.0044                     
 Max.   : 3.832   Max.   : 3.93339   Max.   : 4.2808

Width higher than 3rd quartile, length the largest of all, but petiole also above 3rd quartile. So sort of right.

What can happen is that the original variables can be positively correlated, so that one of them being large and the other one small is a lot to ask. Here’s length vs. petiole:

ggplot(leafshape_p, aes(x = loglen, y = logpet, colour = location)) +
  geom_point()

The length and petiole are indeed positively correlated, both overall and within a species. The leaf we were looking at is the one way over on the right; because it has a large length, we would also expect it to have a largish petiole, as it does. You might say that when it comes to LD1, the length and width “win” over the petiole length. That is to say, the principal distinction between the locations actually seems to be based on the leaf length and width: large is Central America, medium is Queensland, small is Tasmania:

ggplot(leafshape_p, aes(x = loglen, y = logwid, colour = location)) +
  geom_point()

At least most of the time.

Points: \((1+2+5+3+3+2+3) + (1+ 11(2) + 3) = 19 + 26 = 45\).

Footnotes

The pivot_longer code is long, so I split it over three lines to make it easier to read. If your web browser window is too narrow, it will otherwise spill off the right side and you won’t be able to see it without scrolling.↩︎
The times are equally spaced, so the graph looks exactly the same either way, apart from the exact values on the \(x\)-axis.↩︎
Actually, a “random intercept”.↩︎