library(tidyverse)
library(marginaleffects)
library(survival)STAD29 Assignment 4
Packages
note to self: compare survival curves on worksheet and assignment
Acute myelogenous leukemia
Acute myelogenous leukemia, also called AML, is a cancer of the blood and bone marrow. It is treated by chemotherapy. If the treatment is successful, the leukemia is said to be “in remission”, meaning that it has gone away for now, but cancers can return, and the goal of a treatment is to have the length of the remission be as large as possible.
In a 1974 study at Stanford, 23 patients whose AML was in remission were randomized into two groups. The treatment group received “maintenance chemotherapy”, while the control group did not. It is believed that maintenance chemotherapy increases the length of the remission (that is to say, it causes the leukemia to take longer to return).
The data in http://datafiles.ritsokiguess.site/aml.csv contain the following columns:
time: length of remission in weeks (time until recurrence of the leukemia).cens: 1 indicates that the leukemia returned so thattimefor such a patient is the time until the leukemia returned; 0 indicates that recurrence was never observed, either because the patient left the study, or they were still in remission at the end of the study.treatment:maintif the patient received maintenance chemotherapy,noneif they did not.
- (1 point) Read in and display (some of) the data.
I don’t think you needed me to tell you that this was first:
my_url <- "http://datafiles.ritsokiguess.site/aml.csv"
aml <- read_csv(my_url)Rows: 23 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): treatment
dbl (2): time, cens
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
amlScroll down to see the patients in group none.
Extra: this is the aml dataset from the boot package, which is the same data (expressed differently) as the aml dataset from the survival package. I made two changes for you:
- I relabelled the treatment groups to make it clearer that one group had maintenance chemotherapy and the other did not. (The original values were 1 and 2, which were (i) numeric when they really ought to have been text, and (ii) not making it clear which group was which.)
- The treatment column was originally named
group. Depending on your precise setup withmarginaleffects, this can cause problems later (remember, when you were doing logistic regression with several response categories, thatpredictionsused the columngroupto tell you which response category it was predicting for? That.)
- (2 points) Create and display (but you don’t need to save) a dataframe that includes a suitable response variable for a Cox proportional hazards model.
This uses Surv from the survival package. There are two inputs to Surv: the time variable, and something that is true if the event of interest happened. The event of interest is that the leukemia returned, which corresponds to cens being 1. I called mine y (you can call it whatever you like):
aml %>% mutate(y = Surv(time, cens == 1))It is perhaps a little confusing that cens being 1 corresponds to data that is not censored, but so it is.
- (2 points) How does your new variable distinguish patients for whom the leukemia returned from patients for whom it did not?
Where the leukemia returned (cens == 1), the new column just contains a number. When it was never observed to return, the number has a plus sign.
Make sure you are clear about what distinguishes the values with a + from values without (in the context of the data). If you like, use the word “censored” (the ones with a + are censored observations), but be sure to explain what that means here. (“Censored” means “the event of interest didn’t happen”, but what that event actually is will depend on the data you’re looking at.)
The mnemonic for the + is that the value is “at least” the number shown. For example, the observation 28+ means that the time until the leukemia returned would have been at least 28 weeks, had it been observed.
Extra: the notation Surv[,2] below the column heading means that even though it looks like only one column, my y is actually two columns: one with the time, and one with the censoring indicator. It is thus a kind of list-column, specifically a two-column matrix stored in one column.
I was hoping to be able to show you y’s innards, but I seem to be unable to do so.
- (2 points) Fit a Cox proportional hazards model to assess the effect of the treatment on remission times, and display the output.
This kind of thing, using coxph:
aml.1 <- coxph(Surv(time, cens == 1) ~ treatment, data = aml)
summary(aml.1)Call:
coxph(formula = Surv(time, cens == 1) ~ treatment, data = aml)
n= 23, number of events= 18
coef exp(coef) se(coef) z Pr(>|z|)
treatmentnone 0.9155 2.4981 0.5119 1.788 0.0737 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
treatmentnone 2.498 0.4003 0.9159 6.813
Concordance= 0.619 (se = 0.063 )
Likelihood ratio test= 3.38 on 1 df, p=0.07
Wald test = 3.2 on 1 df, p=0.07
Score (logrank) test = 3.42 on 1 df, p=0.06
It seems to work better (when using plot_predictions below) to define the response variable (again) directly in the modelling statement. I only asked you to create the column I called y in your dataframe because I wanted you to look at it; otherwise, you could proceed directly to fitting the model.
In terms of accounting for things: you have three columns in the dataframe, one of which is treatment group, and the other two are used in making the response: the length of remission, and whether or not the leukemia returned. So we have used everything.
- (2 points) Does your model indicate that the maintenance chemotherapy has a positive effect? Explain briefly. (Hint: think about what the Cox model output is telling you.)
The output gives us a P-value for group (specifically, a P-value for the comparison of the none group against the baseline maint group). This is testing whether the coefficient of groupnone is zero, against the alternative that it is not zero, a two-sided test. (This is the same principle as in regression output, where you get a test that the slope is zero vs. that it is not zero.) But we wanted to know whether the maintenance chemotherapy has a positive effect, a one-sided test.
The principle for getting a one-sided test from a two-sided one works anytime, including here:
- ask whether we are on the correct side (the data points towards the alternative being true)
- if we are on the correct side, halve the P-value. (If we are on the wrong side, the result is “not significant”.)
The coefficient for groupnone is positive. This means that the hazard of recurrence of the leukemia is higher for the none group than it is for the baseline maint group. Another way to say this is that on average the time to recurrence is longer for the maint group, which is what we wanted evidence for: we are indeed on the correct side.
Having seen that (and you will need to be careful in explaining how you know), we can take the P-value and halve it:
0.0737 / 2[1] 0.03685
This is less than 0.05, so we can conclude that the maintenance chemotherapy does have a positive effect, because it significantly lengthens time to recurrence.
To get this right, you’ll have to join all the dots: the coefficient is positive, say what that means in terms of relative times to recurrence for the two groups, and then say that we are on the correct side. The coefficient being positive doesn’t mean anything until you work out what it actually means.
Extra: you might have thought, since group is categorical, that we should have used drop1:
drop1(aml.1, test = "Chisq")but the problem here is that this is testing for any effect of the treatment, positive or negative and, had it been significant at 0.05, we would not have known which way the effect goes. It might have been that the treatment was harmful, and you don’t want to be claiming that the treatment is beneficial when it is actually harmful, do you?
You’ll also note that the (two-sided) P-value from drop1 is slightly different from the one from summary. This is the Wald test vs. likelihood ratio test thing that I talked about in connection with the hypernephroma data on the worksheet.
- (3 points) Make a plot of the predicted survival probabilities against time for each treatment group. Does the plot support your conclusion about the effectiveness of the maintenance chemotherapy?
This is exactly what plot_predictions does. There are some things to get right:
- the first thing in
conditionis the time variable (on the \(x\)-axis), here calledtime - the second thing in
conditionis the explanatory variable in your Cox model, heregroup. This is categorical this time, so it makes sense to display the groups with colours. You might have a quantitative explanatory variable (often “age” in this kind of work), and then the plot would show survival curves for a selection of ages, as on the worksheet problem. - you have to specify
typeof predictions for this kind of model. We want predicted probabilities of “survival”, that is, of the leukemia not having returned, by each time. The clue that you have it right is that the \(y\)-axis goes from 0 to 1. If you forget this, you’ll get survival curves that go uphill. You know this cannot be right, because the longer you wait, the probability that the event (here recurrence of leukemia) hasn’t happened yet must go down.
That leads to this:
plot_predictions(aml.1, condition = c("time", "treatment"), type = "survival")Warning: The default delta method standard errors for `coxph` models only take
into account uncertainty in the regression coefficients. Standard errors may be
too small. Use the `inferences()` function or set `vcov` to "rsample", "boot"
or "fwb" to compute confidence intervals by bootstrapping. Set `vcov` to
`FALSE` or `options(marginaleffects_safe=FALSE)` to silence this warning.
Warning: The default delta method standard errors for `coxph` models only take
into account uncertainty in the regression coefficients. Standard errors may be
too small. Use the `inferences()` function or set `vcov` to "rsample", "boot"
or "fwb" to compute confidence intervals by bootstrapping. Set `vcov` to
`FALSE` or `options(marginaleffects_safe=FALSE)` to silence this warning.
Warning: Removed 50 rows containing missing values or values outside the scale range
(`geom_ribbon()`).
The last thing is to say whether this graph indicates that the maintenance chemotherapy has a positive effect. At any time, the predicted probability of not having seen the leukemia return is higher for the maintenance chemotherapy group, compared to the none group, which is a good thing and an indication that the maintenance chemotherapy is indeed useful. (The conclusion you draw from this graph should be the same as you drew from the test; if you did not find a significant difference, you will have to try to sell the grader on these survival curves being “not very different”.)
Depending on your precise setup, you might get a message like this one:
Warning: The default delta method standard errors for `coxph` models only take
into account uncertainty in the regression coefficients. Standard errors may be
too small. Use the `inferences()` function or set `vcov` to "rsample", "boot"
or "fwb" to compute confidence intervals by bootstrapping. Set `vcov` to
`FALSE` or `options(marginaleffects_safe=FALSE)` to silence this warning.
I get this on my desktop but not my laptop, and on my desktop I get it twice.
The key thing about error and warning messages is to read them. This one says that the way it works out standard errors for estimates (and thus the coloured envelopes on the plot) may be too optimistic (there is actually more uncertainty than the plot shows). This doesn’t matter so much, because we have already concluded that the treatment has a significantly positive effect, so we know that those survival curves really are different and the uncertainty about them doesn’t matter so much. The last part of the message tells you two ways to get rid of the warning. The first way is the easiest, since vcov is one of the inputs to plot_predictions:
plot_predictions(aml.1, condition = c("time", "treatment"), type = "survival", vcov = FALSE)Ignoring unknown labels:
• fill : "treatment"
This gets rid of the coloured envelopes entirely (vcov stands for “variance-covariance”, so you might guess it has something to do with uncertainty). There is a different message now (for me, on my desktop), which looks like a bug because there is no fill anywhere any more.
Points: \(1 + 2 + 2 + 2 +2+ 3 = 12\).