STAD29 Assignment 1

Packages

library(tidyverse)
library(broom)
library(marginaleffects)

Rolling a lawn

Lawn rolling is a maintenance technique developed to flatten the top one to two inches of soil surfaces (of an area of grass), rendering them smooth and level. Steel or polyurethane drums or cylinders, filled with water or sand, are pulled by hand or towed to flatten irregularities caused by frost heaving or to smooth out high-traffic areas. Source.

In determining the weight of roller to use, we are interested in the relationship between the weight of the roller (in tonnes) and the amount it compresses the lawn (“depression”, in millimetres), under dry conditions. The data are in http://ritsokiguess.site/datafiles/roller.csv.

(1 point) Read in and display the data (there are only ten observations).

As usual:

my_url <- "http://ritsokiguess.site/datafiles/roller.csv"
roller <- read_csv(my_url)

Rows: 10 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): weight, depression

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

roller

I called the dataframe roller.

Source: the roller dataset from the DAAG package, used as is.

Extra: if you are a fan of the game of cricket, you will know about the use of a roller in preparing the pitch. The idea is that a cricket pitch should be smooth, so that the bounce of the ball off the surface is consistent (and safe for the batter), but also the surface should be compacted and hard (so that there is pace and bounce for the bowlers). Cricket groundstaff know how to use rollers of the appropriate weight to achieve this. Source. During a match, the captain of the team that is about to bat may also request that the pitch be rolled (if they think it will make the pitch better for batting on).

(3 points) Make a suitable plot of the data. Is the trend apparently linear?

Two quantitative variables, so a scatterplot. The weight of the roller is explanatory (input), and the depression caused by the roller is the response (output), so:

ggplot(roller, aes(x = weight,  y = depression)) + geom_point()

This suffices. Optionally, you can also add a smooth trend (not a regression line, yet):

ggplot(roller, aes(x = weight,  y = depression)) + geom_point() + geom_smooth()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The trend of the points is more or less linear (or, at least, not obviously curved, as far as you can tell with only ten points). Likewise, if you added a smooth trend, it wiggles a bit but there is no obvious curved shape. Your answer should get as far as “not obviously non-linear” if you like double negatives, or “approximately straight” if you don’t.

Extra: the major purpose of making a scatterplot at this point is to see whether we have a relationship, and if so, what kind of relationship it is.¹ Putting a regression line on this plot:

ggplot(roller, aes(x = weight,  y = depression)) + geom_point() + 
  geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

is putting the cart before the horse (what philosophers call “begging the question”) because it is too soon to ask how well a linear relationship fits (we don’t know whether we should fit one yet). I also find that putting a regression line on a scatterplot biases me, in that it is somehow suggesting that a straight line will fit well, and it’s harder work that it should be to decide that a straight line is actually not the right thing.

(2 points) Fit a suitable regression, and display the output.

You already decided that depression should be the response, so:

roller.1 <- lm(depression ~ weight, data = roller)
summary(roller.1)


Call:
lm(formula = depression ~ weight, data = roller)

Residuals:
   Min     1Q Median     3Q    Max 
-8.180 -5.580 -1.346  5.920  8.020 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -2.0871     4.7543  -0.439  0.67227   
weight        2.6667     0.7002   3.808  0.00518 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.735 on 8 degrees of freedom
Multiple R-squared:  0.6445,    Adjusted R-squared:  0.6001 
F-statistic:  14.5 on 1 and 8 DF,  p-value: 0.005175

I didn’t ask for comment, but you might like to note for yourself that there is a moderately high R-squared (suggesting a real relationship with weight, but one with a certain amount of scatter). This is supported by the small P-value (0.005) for weight, and the slope of 2.67 is positive, indicating an upward trend.

Extra: this is not a C32-style regression question, so I’m not asking you to make residual plots, but here they are:

ggplot(roller.1, aes(x = .fitted, y = .resid)) + geom_point()

Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggplot2 package.
  Please report the issue at <https://github.com/tidyverse/ggplot2/issues>.

The points are at least all over the plot. The negative residual with fitted value close to 20 argues against a curve. This is perhaps not the most convincing “random” plot you will ever see, but at least it is true that knowing the fitted value doesn’t tell you anything much about the residual.

ggplot(roller.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()

This is not bad for only ten observations.

(4 points) For values of weight from 2.5 to 12.5 in steps of 2.5, obtain predicted values of depression, along with confidence intervals for the mean depression at those weights.

These are confidence intervals for the mean response, so they are the thing that predictions from marginaleffects produces. To set this up, create a dataframe (that I usually call new) with the values of weight to predict for:

new <- datagrid(model = roller.1, weight = seq(2.5, 12.5, 2.5))
new

and then feed that into cbind(predictions), and select the columns you want, which are these, in some order:²

cbind(predictions(roller.1, newdata = new)) %>% 
  select(weight, estimate, starts_with("conf"))

For yourself, you might note that these intervals are on the wide side, a consequence of only having ten observations.

Extra 1: you can create new however you wish. In this case, tibble would also work. I actually wrote these solutions using tibble, until I realized that I had used datagrid in the lecture notes, so I figured I ought to be consistent. The key thing is to make sure you have a dataframe with a column called weight and those five values in it.

Extra 2: prediction intervals for individual observations are gotten as below. Note that predict only produces the predictions and intervals themselves, not what they are predictions for, so the usual process is to save the predictions and then put them side by side with the values they are predictions for, which are in new. Take a look at what I called p to confirm that:

p <- predict(roller.1, new, interval = "p")
cbind(new, p)

These intervals are even wider: so wide, you might say, as to be almost useless.³

(2 points) Obtain a plot of the predicted depression values against weight.

This is what plot_predictions does:

plot_predictions(roller.1, condition = "weight")

Comment: the predictions form a line (of course), and the envelope around the line is the confidence interval for a prediction at each weight. (We are coming to this.) The confidence intervals are narrower in the middle, because there are more “nearby” points to base a prediction on.

(2 points) For one of your predictions in Question 4, verify that its confidence interval matches what you see on your graph of the previous question, describing your process.

It is perhaps easier to pick one of the wider intervals, say the one for weight 10. In question 4, the confidence interval went from 17.75 to 31.4. On the graph, the interval should correspond to the bottom and top of the envelope at that value of weight. For a weight of 10, the bottom of the envelope is just over halfway from 15 to 20 (that is, just over 17.5), and the top of the envelope is less than halfway from 30 to 35 (so, less than 32.5). These are consistent with the values from question 4.

For you, read the values off the graph (it is enough to be only as accurate as I was) and demonstrate that they are consistent with what you had before.

Orobranche cernua

Orobranche cernua, commonly known as “nodding broomrape”, is a herb that grows across Europe and northern Asia. A biologist suspects that seeds of this herb germinate better in a bean root extract that has been diluted in water, and conducts a study to understand the effect of different dilutions on the germination rate. The data are in http://ritsokiguess.site/datafiles/aod_orob1.csv. The three different dilutions are expressed as a fraction: for example, 1/25 means 1 part of bean root extract and 25 parts of water. Also in the data are the number of seeds n exposed to the bean root extract at that dilution, and the number y of those seeds that germinated successfully. There are several replicates at each dilution.

(1 point) Read in and display (some of) the data.

As usual:

my_url <- "http://ritsokiguess.site/datafiles/aod_orob1.csv"
oro <- read_csv(my_url)

Rows: 16 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): dilution
dbl (2): n, y

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

oro

You might have a bit of trouble finding a good and short enough name for the dataframe. Anything that is at least reasonably descriptive is good. Mine is a very minimal name.

Source: the orob1 dataset from the aod package.

(2 points) How do you know that there is more than one observation per row of the dataframe?

The observations in this case are seeds (each one either germinates or not, which is the actual response), and there is more than one seed per row, specifically the number in the column n. (If there had only been one seed per row, there would have been a column with a name like germinated and values “yes” and “no”, or “true” and “false”).

(3 points) Fit a suitable logistic regression, and display the output.

Bear in mind that we have “grouped” observations (more than one per row), so we need a two-column response, the number of successfully germinated seeds in the first column (which is y), and the number of seeds that failed to germinate in the second (which you will have to include a calculation for):

oro.1 <- glm(cbind(y, n - y) ~ dilution, family = "binomial", data = oro)
summary(oro.1)


Call:
glm(formula = cbind(y, n - y) ~ dilution, family = "binomial", 
    data = oro)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -1.8015     0.1851  -9.732   <2e-16 ***
dilution1/25    3.7225     0.2717  13.703   <2e-16 ***
dilution1/625   3.4771     0.2559  13.585   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 387.689  on 15  degrees of freedom
Residual deviance:  22.945  on 13  degrees of freedom
AIC: 78.273

Number of Fisher Scoring iterations: 4

(2 points) Is there a significant effect of dilution as a whole? Explain briefly, obtaining some more output if necessary.

Having a categorical explanatory variable means that the two (very significant) tests shown in the summary output are of the level shown against the baseline 1/1 dilution, and not what we want. Hence, we need a drop1, as in regular regression (R knows what to do for this kind of model). This is a generalized linear model, so the test needs to be Chisq rather than F.

drop1(oro.1, test = "Chisq")

With a tiny P-value of \(6.3 \times 10^{-80}\) (!), there is definitely an effect of dilution overall.

This is the best approach, but there is partial credit for comparing the null and residual deviances in the summary: here, it says that the null model with nothing (ie., without dilution) fits much worse than the model with dilution in it, because the null model has much larger deviance than the model we actually fitted. Hence, we do much better to include dilution in the model, as opposed to not including it. This is not full credit, though, because this approach does not give you a P-value for dilution as a whole, even though it does suggest that whatever the P-value is, it will be pretty small.

Another way you might think of going is anova (in the sense of comparing two models). Compare the model with dilution (the one you fitted) against the model without. The latter model has nothing in it, and you might have to investigate (cite your source) to find out how to fit a model with just an intercept. The answer is to have a 1 on the right side of the squiggle:

oro.0 <- glm(cbind(y, n - y) ~ 1, family = "binomial", data = oro)

We don’t need to look at this, since all we are going to do is to compare its fit with the model I called oro.1:

anova(oro.0, oro.1, test = "Chisq")

Remember that the test is Chisq again, for the same reason as above. Looking at the output here vs. the drop1, you see that the test and P-value are exactly the same, so this, properly reasoned, is also a full-credit answer.

(2 points) Obtain predicted probabilities of germination, with confidence intervals, for each dilution.

This is most easily done using marginaleffects. First, make a dataframe (new) containing the three dilutions. Obtain this how you like, but it is safer to get it via a calculation rather than typing in the values one by one.

This is the easiest way: count the dilutions (and then ignore the actual counts):

oro %>% count(dilution) -> new
new

You can also use datagrid (having loaded marginaleffects), but for this you need just the three different values of dilution, which you can get either from the above or like this:

new1 <- datagrid(model = oro.1, dilution = unique(oro$dilution))
new1

In this case, you get some additional columns, with a column identifying the rows, and the (meaningless) mean values of y and n, but as long as you have the three values of dilution you want, it doesn’t matter what other columns you have.

Now, one way or another, you have a dataframe with a column called dilution with the three distinct values in it, and you can use that as input to predictions:

cbind(predictions(oro.1, newdata = new))

This gives you too many columns (as usual), so select the ones you want, which this time include the confidence limits (you will see why later):

cbind(predictions(oro.1, newdata = new)) %>% 
  select(dilution, estimate, conf.low, conf.high)

For me, it makes more sense to have the values being predicted for be first, but I have no objection if you have the columns in some other order, such as the order they appear in the predictions output.

(2 points) What do your predictions suggest about the reason for the significance or non-significance of your result in question 10?

In question 10, we found there was a (very) significant effect of dilution. The predictions here suggest that this is because the 1/1 dilution is very different from the other two (less germination), which are very similar in terms of germination to each other.

There are two reasons why the 1/25 and 1/625 dilutions are similar in terms of germination: first, their Estimates are very close, but second and more insightfully, their confidence intervals overlap, so from that point of view, they are not significantly different in terms of germination.

Extra:

I was expecting the Estimates to be in order (increasing, in this case), in that the dilutions are in decreasing order in the table. But remember, the dilution is being treated as categorical, so that a separate probability of germination is estimated for each dilution, and they don’t have to be in any particular order. It does look, however, as if smaller dilution values go with a larger probability of germination.

If we had fitted dilution as quantitative, then things would have come out in order. It takes a bit of effort to get the dilution values out:

oro %>% mutate(denom = str_extract(dilution, "[0-9]+$")) %>% 
  mutate(dilution_numeric = 1 / as.numeric(denom)) -> oro
oro

I used a regular expression to pull out the denominator of the dilution (the numerator is always 1). This one says “match one or more numbers and then the end of the text”. This is still text, though, so when I made my numeric dilutions, I had to turn it into a number before working out 1 over it.

These numeric dilutions differ by a constant multiple: 1, that over 25, that over 25, so I reckon it makes sense to use the log of the numeric dilution in a model:

oro.2 <- glm(cbind(y, n - y) ~ log(dilution_numeric), family = "binomial", data = oro)
summary(oro.2)


Call:
glm(formula = cbind(y, n - y) ~ log(dilution_numeric), family = "binomial", 
    data = oro)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -1.18428    0.14171  -8.357   <2e-16 ***
log(dilution_numeric) -0.59697    0.04379 -13.633   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 387.69  on 15  degrees of freedom
Residual deviance: 107.21  on 14  degrees of freedom
AIC: 160.54

Number of Fisher Scoring iterations: 5

plot_predictions(oro.2, condition = "dilution_numeric")

As the dilution number increases, the probability of germination decreases, so that a more dilute bean extract (one with a bigger number) is better. The graph is in terms of the actual numeric dilutions, which are these:

oro %>% mutate(ldn = log(dilution_numeric))

two of which are way over on the left side of the graph; the slope in the output says that as log of the numeric dilution increases by one, the log-odds of germination decreases by about 0.60. For these three dilutions, the log dilution numbers differ by a bit over 3 from one to the next, so the log-odds of germination change by this much:

3.22 * 0.60

[1] 1.932

which is a pretty big change in log-odds terms, and therefore in terms of probability:

new <- datagrid(model = oro.2, dilution_numeric = unique(oro$dilution_numeric))
cbind(predictions(oro.2, newdata = new)) %>% 
  select(dilution_numeric, estimate, conf.low, conf.high)

In this case, the confidence intervals for the 1/25 and 1/625 dilutions no longer overlap. This may have to do with the additional structure we imposed in this model, including using the log of the dilution number which means that 1/1, 1/25, and 1/625 are equally spaced in log-dilution terms. This may or may not have been reasonable; let’s try to add the data to the graph above:

oro %>% mutate(prop_germ = y/n) -> oro
oro

plot_predictions(oro.2, condition = "dilution_numeric") +
  geom_point(data = oro, aes(x = dilution_numeric, y = prop_germ))

This is, in fact, not very convincing at all. The two lowest dilution_numeric values (close together on the left) look as if they should have about the same germination probability (as our first model indicated), and the 1/1 dilutions should have a smaller probability of germination than our second model indicates.

Points: \((1+3+2+4+2+2) + (1+2+3+2+2+2) = 14 + 12 = 26\)

Footnotes

This is the same kind of thinking as looking at a histogram first with a quantitative variable, to see what shape of distribution you have, and then looking at a normal quantile plot if you then decide that normality is of interest.↩︎
You can name the columns of confidence limits individually.↩︎
I have to say, this is not uncommon with prediction intervals.↩︎