Worksheet 5

Published

January 30, 2026

Packages

library(tidyverse)
library(marginaleffects)
library(survival)

Hypernephroma

Hypernephroma, also known as renal cell carcinoma or kidney cancer, is a type of cancer that starts in the kidneys. It’s one of the most common types of kidney cancer in adults. Source.

The 33 patients in http://ritsokiguess.site/datafiles/lee_hypernephroma.csv, from a study carried out in the 1970s, all had hypernephromia and were all treated with chemotherapy, immunotherapy, and hormonal therapy. (That is to say, in this study all the patients got the same treatment, so treatment is not one of the explanatory variables.) There are a lot of columns in our data, among them:

age of patient in years
gender of patient, noted as F or M.
date of treatment_start, as text (month - day - year)
date of treatment_end (last followup or date of death), as text
status of patient when last seen
the last five columns are the results of skin tests taken at the start of treatment.

The researchers wanted to see whether any of the skin test results, as well as the age and gender of the patient, helped in predicting survival time after the start of treatment.

Read in and display (some of) the data.

my_url <- "http://ritsokiguess.site/datafiles/lee_hypernephroma.csv"
hypernephroma <- read_csv(my_url)

Rows: 33 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): gender, treatment_start, treatment_end, status
dbl (8): patient, age, response, monilia, mumps, PPD, PHA, SK_SD

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

hypernephroma

I gave my dataframe a long name, but I don’t have to type it out every time; I type something like hyp, wait, and watch for the autocomplete to tell me the rest of it. Hitting Enter will select it, or I can move up and down if there are several choices.

There are 33 patients and 12 variables, including the ones mentioned above.

If you can’t see all the column names (that you will need later), you have the glimpse trick¹ at your disposal:

glimpse(hypernephroma)

Rows: 33
Columns: 12
$ patient         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ age             <dbl> 53, 61, 53, 48, 55, 62, 57, 53, 45, 58, 61, 61, 77, 55…
$ gender          <chr> "F", "M", "F", "M", "M", "F", "M", "M", "M", "M", "F",…
$ treatment_start <chr> "3/31/77", "6/18/76", "2/1/77", "12/19/74", "11/10/75"…
$ response        <dbl> 1, 0, 3, 2, 0, 2, 0, 2, 0, 3, 3, 1, 0, 2, 3, 0, 0, 3, …
$ treatment_end   <chr> "10/1/77", "8/21/76", "10/1/77", "1/15/76", "1/15/76",…
$ status          <chr> "alive", "dead", "alive", "dead", "dead", "dead", "dea…
$ monilia         <dbl> 7, 10, 0, 0, 12, 10, 15, 0, 6, 13, 0, 9, 0, 0, 0, 11, …
$ mumps           <dbl> 23, 15, 7, 0, 0, 5, 15, 0, 4, 13, 8, 12, 0, 0, 14, 7, …
$ PPD             <dbl> 0, 0, 0, 0, 10, 0, 0, 0, 0, 22, 17, 0, 0, 15, 5, 0, 0,…
$ PHA             <dbl> 25, 13, 25, 0, 8, 7, 0, 12, 0, 23, 11, 20, 0, 10, 32, …
$ SK_SD           <dbl> 0, 9, 0, 0, 5, 5, 10, 0, 0, 0, 0, 0, 0, 0, 21, 0, 0, 1…

Extra (long): this data set came from a book that I have as a pdf. The table of data was split over two pages, so I needed to process two pages in the same way. With that in mind, I wrote a function saying which page I wanted to extract and how I wanted to pull out the data (and how many columns to extract it to). Then I ran the function twice, and glued the results together:

library(pdftools)
lee_data <- function(my_url, wanted, n_col) {
  lee <- pdf_text(my_url)
  lee_strings <- strsplit(lee, "\n")
  lee_strings[wanted] %>% 
    unlist() %>%  
    trimws(which = "left") %>% 
    str_split_fixed(" {2,}", n_col) %>% 
    as_tibble() %>% 
    filter(str_detect(V4, "/")) %>% 
    separate_wider_delim(V11, delim = " ", names = c("V11a", "V11b"), too_few = "align_start") %>% 
    mutate(across(V8:V11b, \(x) ifelse(x == "ND", 0, parse_number(x)))) %>% 
    mutate(V11 = V11a,
           V12 = coalesce(V11b, V12)) %>% 
    mutate(V7 = ifelse(V7 == "1", "dead", "alive")) %>% 
    mutate(V2 = as.numeric(V2)) %>% 
    select(
      patient = V1,
      age = V2,
      gender = V3,
      treatment_start = V4,
      response = V5,
      treatment_end = V6,
      status = V7, 
      monilia = V8,
      mumps = V9,
      PPD = V10,
      PHA = V11,
      SK_SD = V12
    )
}
a1 <- lee_data(my_url, 61, 12)
a2 <- lee_data(my_url, 62, 12)
hypernephroma <- bind_rows(a1, a2)
hypernephroma

The function takes the filename of the whole book (which is not really a URL, but that’s what it got called), the page I want to extract data from, and the number of columns of data there are. The function is a rather terrifying pipe, but I built it up one line at a time, returned what I had made so far, and tested it as I went, just as you would. Here’s how it goes:

using pdftools, extract the pdf into a vector of pages.
turn the \n in the vector of pages into genuine newlines, creating a list of pages, with the text of each line on a page as an element of a vector within that page.
grab the page I want
remove any list attributes it has, so that it’s now just a vector of lines
remove any whitespace on the left of each line
break into as many columns as I asked for, splitting on two or more spaces
turn into a dataframe.
there are some header and footer rows. The lines I want have dates in them. I found that choosing the rows with a slash in the treatment start date (V4) was enough; I could have used a regular expression like /7.$, since the years are all in the 1970s, but I didn’t need to be that clever.
Some of the skin test results that should have been in columns V11 and V12 ended up both in column V11, with V12 being empty. (I think they were separated by only one space rather than two.) They were separated in column V11 by a space, so I split them up into new columns V11a and V11b. If there was actually only one thing in V11, it ends up in V11a with a missing in V11b, which is what the too_few thing does.
if the value in any of V8 through V12 is not recorded (recorded as ND for “not done”), arbitrarily replace it by zero so that you wouldn’t have missing values to deal with; otherwise, make sure it’s a number.
put V11 and V12 back as they should be: V11 should be whatever ended up in V11a, and V12 should be whatever is in V11b if it is not missing, and left alone otherwise (there is a value already in V12, so use that). coalesce takes several columns; the first one that is not missing is the answer that is used.
turn the number codes in V7 into what they actually are.
turn age into a number (it seems mysteriously to be text).
finally, the great long select grabs the columns I want and gives them sensible names. I used rename for this before, but I wasn’t sure I wanted all the columns this time, and you can do renaming in select also. (It seems that I did use all the columns in the end.)

After defining and testing the function (as I said, one line at a time), I ran it on what I had previously found to be pages 61 and 62 of the pdf (by searching for the text hypernephroma, which only appears twice in the book), glued those two dataframes together above and below (they have the same column names), and saved the result for you.

I actually used the same ideas to put together the NBA data for you on the previous worksheet (which once had an almost equally long explanation there): it was a PDF on the NBA’s website from which I extracted what I needed. But the PDF had disappeared from there the last I checked, so it would have been rather dopey to talk about processing data that no longer existed, and therefore I didn’t. The book I used here is “Statistical Methods for Survival Data Analysis, 3rd edition” by Elisa T. Lee and John Wenyu Wang, should you care to track it down. And if you want to try the code here, the two pages of the book with the data are at http://datafiles.ritsokiguess.site/hypernephroma.pdf. You’ll probably need pages 1 and 2 of this one (not 61 and 62). I think sharing two pages of the book counts as “fair use”.

Convert the treatment start and treatment end dates into actual dates.

The obvious way is to do two mutates, one for the treatment start date, and the other for the treatment end date. The treatment dates are written as month-day-year with two-digit years (which will get mapped to the right century), thus (I overwrote my dates, but you can create new columns if you prefer):

hypernephroma %>% 
  mutate(treatment_start = mdy(treatment_start)) %>% 
  mutate(treatment_end = mdy(treatment_end))

These are indeed now dates, and display in ISO format with the year first and the right century. Save your dataframe, since we are going to do more with it.

You might be wondering how a two-digit year without century is handled. Functions like mdy use a base R function called strptime² and according to its documentation, it uses a 2018 standard that says that two-digit years from 00 to 68 inclusive are assumed to be in the 2000s, and those from 69 to 99 are assumed to be in the 1900s. The ones here are thus correctly assumed to be in the 1970s.

I actually did it differently: I saw that the two dates both start with treatment, so I can turn them both into dates at once:

hypernephroma %>% 
  mutate(across(starts_with("treatment"), \(tr) mdy(tr))) -> hypernephroma
hypernephroma

In English, that says “for each of the columns whose name starts with treatment, replace it by the mdy version of itself”. (mutate with across by default changes the columns in place rather than creating new columns.)

Work out the number of days between the start and the end of the treatment. Check that your result is indeed a number of days. Turn it, if necessary, into an actual number (with as.numeric), for plots later.

Back near the beginning of dates and times, I said that you could do arithmetic with dates. You can add a number of days to a date and get another date, or you can subtract two dates and get a number of days. Here, subtracting the two dates is what we want:

hypernephroma %>% 
  mutate(surv_time = treatment_end - treatment_start) -> hypernephroma
hypernephroma %>% select(treatment_start, treatment_end, surv_time)

The column I called surv_time is indeed a number of days because it says so. You can also eyeball the two dates and see that the number of days is about right. The first is about six months, the second is just over two months, and the fourth one is about a month more than a year.³

Using a time difference like this works all right for modelling, but plot_predictions later needs an actual number, so we need to make it one:

hypernephroma %>% 
  mutate(surv_time = as.numeric(surv_time)) -> hypernephroma
hypernephroma %>% select(treatment_start, treatment_end, surv_time)

Extra: in this case, we got lucky in that the time difference got counted as a number of days, and that was what we wanted. This was unlike the example in the lecture notes, where it came out as a number of hours. If you want to make sure you control the units, the idea in the lecture notes is what you need: turn it into a period, then divide by the number of seconds in one of your chosen time unit:

hypernephroma %>% 
  mutate(surv_time = as.period(treatment_start %--% treatment_end) / days(1)) %>% 
  select(treatment_start, treatment_end, surv_time)

This time, surv_time is a number. We know it is a number of days because we set it up to be that. Except: our treatment start and end are dates, so surv_time should be a whole number of days. Why isn’t it? This is something I tend to get wrong. According to the lubridate website, A ‘period’ is a timespan defined in civil time (human) units like years, months, and days.” “2 months” is a human timescale, but we saw near the end of the date-time lecture notes that how long “2 months” actually is (in seconds) depends on which two months you are looking at. This is the difference between a period and a duration: only the latter gets the number of seconds right, so maybe we should turn our interval from treatment_start to treatment_end into a duration rather than a period. This also means using the special ddays (for durations) rather than days (which works for periods):

hypernephroma %>% 
  mutate(surv_time = as.duration(treatment_start %--% treatment_end) / ddays(1)) %>% 
  select(treatment_start, treatment_end, surv_time)

and now it works.⁴

Create a suitable response variable y for a Cox proportional-hazards model, and display it. (You don’t need to save it.) Does it distinguish correctly between patients whose treatment_end was their date of death, and the patients who were still alive at this point?

This uses Surv from the survival package (which you remembered to install and load). Surv requires two things: a column with survival times (which can be numbers of something like days or months, or a date difference like ours⁵), together with something that will be TRUE if the event (here death) has happened:

hypernephroma %>% 
  mutate(y = Surv(surv_time, status == "dead")) %>% 
  select(surv_time, status, y)

It is a fairly recent development that it displays properly in a dataframe; the Surv[,2] at the top of the column indicates that it’s actually two columns in one. I think this is now in the lecture notes as well (it is a recent addition).

Some of the values of the response variable have a + next to them; these are the patients that were still alive the last time they were seen. In the jargon of survival analysis, these are “censored” observations. All we know about the first patient is that they lived at least 184 days, but we don’t know how long they lived after that. The second patient, on the other hand, lived for exactly 64 days after the start of treatment, and then they died.

This is more detailed than you need to be: say something about how the values of the response for the patients who were still alive are distinguished from the ones that died (the + sign), and you are good.

Fit a Cox proportional-hazards model, predicting survival time from age, gender, and the five skin test results. Display the summary of the model. (Hint: copy your Surv from above into your modelling function.)

My model has a long name too (you could give yours a shorter name). The way to fit these in the lecture notes is to repeat the Surv code in the coxph:

hypernephroma.1 <- coxph(Surv(surv_time, status == "dead") ~ age + gender + 
                           monilia + mumps + PPD + PHA + SK_SD, data = hypernephroma)
summary(hypernephroma.1)

Call:
coxph(formula = Surv(surv_time, status == "dead") ~ age + gender + 
    monilia + mumps + PPD + PHA + SK_SD, data = hypernephroma)

  n= 33, number of events= 18 

             coef exp(coef)  se(coef)      z Pr(>|z|)  
age      0.046159  1.047241  0.029412  1.569   0.1166  
genderM  0.047637  1.048790  0.664885  0.072   0.9429  
monilia  0.087752  1.091717  0.057014  1.539   0.1238  
mumps   -0.153552  0.857656  0.067896 -2.262   0.0237 *
PPD      0.001477  1.001479  0.044425  0.033   0.9735  
PHA      0.028134  1.028534  0.039535  0.712   0.4767  
SK_SD    0.034313  1.034909  0.074009  0.464   0.6429  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age        1.0472     0.9549    0.9886    1.1094
genderM    1.0488     0.9535    0.2849    3.8605
monilia    1.0917     0.9160    0.9763    1.2208
mumps      0.8577     1.1660    0.7508    0.9797
PPD        1.0015     0.9985    0.9180    1.0926
PHA        1.0285     0.9723    0.9518    1.1114
SK_SD      1.0349     0.9663    0.8952    1.1965

Concordance= 0.704  (se = 0.062 )
Likelihood ratio test= 10.83  on 7 df,   p=0.1
Wald test            = 10.13  on 7 df,   p=0.2
Score (logrank) test = 10.98  on 7 df,   p=0.1

This tends to make the model formula long, so you might need to split it over two lines, as I have.

Note that only a few of the explanatory variables appear to be significant. You could also use drop1 here:

drop1(hypernephroma.1, test = "Chisq")

The P-values for summary and drop1 are almost the same but not quite. See the Extra to the next part for some more discussion about this.

Extra: the Cox model properly accounts for which patients are censored and for which ones the event of interest (here death) was observed to happen. The model is fit by maximum likelihood (as for the logistic regression models), but now the likelihood has two parts, one for the patients that died, and one for the patients that were still alive when last observed. Sometimes the censored observations don’t tell us much (eg. a small censored observation, where there is not much information in knowing that a patient lived for “at least 10 days”, because you could have guessed that much). But sometimes a censored observation is informative; knowing that a patient lived for “at least 500 days” is almost as good as knowing exactly how long they lived, because a survival of “at least 500 days” still tells you that survival is good under that patient’s experimental conditions.

(2 points) Use step to remove explanatory variables that do not help to predict survival time. Save and display the model that comes out of step. (Some of the explanatory variables will only be significant at 0.10, not 0.05. Keep those.)

This is actually not difficult at all:

hypernephroma.2 <- step(hypernephroma.1)

Start:  AIC=108.93
Surv(surv_time, status == "dead") ~ age + gender + monilia + 
    mumps + PPD + PHA + SK_SD

          Df    AIC
- PPD      1 106.93
- gender   1 106.93
- SK_SD    1 107.12
- PHA      1 107.43
<none>       108.93
- monilia  1 109.34
- age      1 109.41
- mumps    1 113.40

Step:  AIC=106.93
Surv(surv_time, status == "dead") ~ age + gender + monilia + 
    mumps + PHA + SK_SD

          Df    AIC
- gender   1 104.93
- SK_SD    1 105.12
- PHA      1 105.46
<none>       106.93
- monilia  1 107.40
- age      1 107.48
- mumps    1 111.49

Step:  AIC=104.93
Surv(surv_time, status == "dead") ~ age + monilia + mumps + PHA + 
    SK_SD

          Df    AIC
- SK_SD    1 103.13
- PHA      1 103.48
<none>       104.93
- monilia  1 105.41
- age      1 105.49
- mumps    1 109.57

Step:  AIC=103.13
Surv(surv_time, status == "dead") ~ age + monilia + mumps + PHA

          Df    AIC
- PHA      1 101.62
<none>       103.13
- monilia  1 103.81
- age      1 103.84
- mumps    1 107.58

Step:  AIC=101.62
Surv(surv_time, status == "dead") ~ age + monilia + mumps

          Df    AIC
<none>       101.62
- monilia  1 102.36
- age      1 102.45
- mumps    1 106.58

summary(hypernephroma.2)

Call:
coxph(formula = Surv(surv_time, status == "dead") ~ age + monilia + 
    mumps, data = hypernephroma)

  n= 33, number of events= 18 

            coef exp(coef) se(coef)      z Pr(>|z|)  
age      0.04664   1.04774  0.02808  1.661   0.0968 .
monilia  0.09374   1.09828  0.05625  1.667   0.0956 .
mumps   -0.12328   0.88401  0.05211 -2.366   0.0180 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age         1.048     0.9544    0.9916    1.1070
monilia     1.098     0.9105    0.9836    1.2263
mumps       0.884     1.1312    0.7982    0.9791

Concordance= 0.718  (se = 0.055 )
Likelihood ratio test= 10.14  on 3 df,   p=0.02
Wald test            = 9.67  on 3 df,   p=0.02
Score (logrank) test = 10.21  on 3 df,   p=0.02

or, for the last line,

drop1(hypernephroma.2, test = "Chisq")

Two of the explanatory variables are only significant at 0.10 rather than 0.05. This tends to happen with step. We will keep them.

The step is not too hard to follow:

drop PPD
then drop gender (it is common in these kinds of study for women to have better health outcomes than men, but not here)
then drop SK_SD
then drop PHA
then stop (that is, “remove nothing”).

Age is left, along with two of the skin tests.

Extra: Another thing to notice is that the P-values in the summary output and the drop1 output are not the same, though they will usually be close. The tests in summary are what is called Wald tests: the fitting works out standard errors for the coefficients, and then says “if the sample size is large, the coefficient divided by its standard error will be approximately normal under the null hypothesis” and works out a P-value on that basis. This is like what you would do in ordinary regression, where there are formulas that you learn in B27.⁶ The P-values from drop1 are based on likelihood ratio tests (hence the LRT at the top of the column); this time the test is based on the difference in AIC⁷ between the model with and the model without the term listed. In ordinary regression, these two ways of testing are identical, but in a Cox model they would only be identical if you had an infinite sample size, and for actual real-life sample sizes they will be slightly different (and for small samples they may be noticeably different).

As far as I am concerned, here you can pick either summary or drop1 and go with it. The place where it makes a difference is if you have a categorical variable with more than two categories, in which case only drop1 would give you a test for keeping or dropping the categorical variable as a whole. An example in this context would be if each patient was randomly assigned to one of the skin tests; then you would have a categorical variable skin_test with five levels monilia through SK_SD, and you would have to use drop1 to see whether you could drop it as a whole or needed to keep it. This, though, is not what happened; the skin tests were testing for different things, and the doctor ordered all (or most: see the Extra to the first question) of them for each patient.

Plot predicted survival probabilities over time for five representative ages. Hint: your procedure will use representative values for the other variables, so you do not need to supply values for those or indeed choose the values for age. You might get a warning (that you can ignore), here and on the other similar plot.

This is plot_predictions. The things to keep in mind are that the $x$-axis is time (which was called surv_time in this data set), and that the “strata” whose survival probabilities we are comparing are the five values of age, distinguished by colour, so they go second. (Any other variables make facets, and they go on the end, but there are none here.)

So:

plot_predictions(model = hypernephroma.2, condition = c("surv_time", "age"),
                 type = "survival")

Warning: The default delta method standard errors for `coxph` models only take
into account uncertainty in the regression coefficients. Standard errors may be
too small. Use the `inferences()` function or set `vcov` to "rsample", "boot"
or "fwb" to compute confidence intervals by bootstrapping. Set `vcov` to
`FALSE` or `options(marginaleffects_safe=FALSE)` to silence this warning.
Warning: The default delta method standard errors for `coxph` models only take
into account uncertainty in the regression coefficients. Standard errors may be
too small. Use the `inferences()` function or set `vcov` to "rsample", "boot"
or "fwb" to compute confidence intervals by bootstrapping. Set `vcov` to
`FALSE` or `options(marginaleffects_safe=FALSE)` to silence this warning.

The age variable has been “made categorical” for the purposes of this plot (note that it is in second place in the condition, in a spot normally occupied by a categorical variable). But it has to be there, because we have used the $x$-axis spot for the time variable.

Extra: You might be worried about that rather forbidding warning (which appears twice). This is all about the estimation of those coloured envelopes around the predicted survival probabilities. It says that the envelopes may be too small (and not reflect all the uncertainty there is). I am not usually too bothered about the envelopes; I know there is a significant effect of age on survival (at $\alpha = 0.10$), so I could get rid of the envelopes, following one of the hints in the warning:

plot_predictions(model = hypernephroma.2, condition = c("surv_time", "age"),
                 type = "survival", vcov = FALSE)

Ignoring unknown labels:
• fill : "age"

and I think this one is a less cluttered graph.

The reason the warning appears is that marginaleffects tries to do something sensible for a large number of models, not just this model, and sometimes the approach it takes isn’t the best for a particular model that has particular properties.

Describe the effect of increasing age on your plot, and explain briefly how this is consistent with the summary output from your model.

As age increases, the survival curves move down, indicating that an older patient has a lower chance of surviving longer (worse survival).

Looking again at the summary:

summary(hypernephroma.2)

Call:
coxph(formula = Surv(surv_time, status == "dead") ~ age + monilia + 
    mumps, data = hypernephroma)

  n= 33, number of events= 18 

            coef exp(coef) se(coef)      z Pr(>|z|)  
age      0.04664   1.04774  0.02808  1.661   0.0968 .
monilia  0.09374   1.09828  0.05625  1.667   0.0956 .
mumps   -0.12328   0.88401  0.05211 -2.366   0.0180 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age         1.048     0.9544    0.9916    1.1070
monilia     1.098     0.9105    0.9836    1.2263
mumps       0.884     1.1312    0.7982    0.9791

Concordance= 0.718  (se = 0.055 )
Likelihood ratio test= 10.14  on 3 df,   p=0.02
Wald test            = 9.67  on 3 df,   p=0.02
Score (logrank) test = 10.21  on 3 df,   p=0.02

the coefficient for age is positive, meaning that an older patient has a higher hazard of death, which also points to worse survival.

Repeat the previous two questions for mumps skin test values.

Copy, paste, and edit:

plot_predictions(model = hypernephroma.2, condition = c("surv_time", "mumps"),
                 type = "survival")

Warning: The default delta method standard errors for `coxph` models only take
into account uncertainty in the regression coefficients. Standard errors may be
too small. Use the `inferences()` function or set `vcov` to "rsample", "boot"
or "fwb" to compute confidence intervals by bootstrapping. Set `vcov` to
`FALSE` or `options(marginaleffects_safe=FALSE)` to silence this warning.
Warning: The default delta method standard errors for `coxph` models only take
into account uncertainty in the regression coefficients. Standard errors may be
too small. Use the `inferences()` function or set `vcov` to "rsample", "boot"
or "fwb" to compute confidence intervals by bootstrapping. Set `vcov` to
`FALSE` or `options(marginaleffects_safe=FALSE)` to silence this warning.

or, if you don’t want either the warning or the coloured envelopes:

plot_predictions(model = hypernephroma.2, condition = c("surv_time", "mumps"),
                 type = "survival", vcov = FALSE)

Ignoring unknown labels:
• fill : "mumps"

This time, mumps varies, and representative (mean) values are used for the other variables. Here, a higher score here is associated with a better chance of living longer.

Looking back at the summary, mumps has a negative coefficient, so if mumps is higher, the hazard of death is lower (and thus survival is better).

Footnotes

Not to be confused with glance from the broom package.↩︎
The opposite of strftime, which I talk about in my solution to the Halifax snow problem on assignment 3.↩︎
I don’t know what drtn stands for; I know of this kind of thing as a “difftime”, especially when it’s a time as well as a date. A little research reveals that it’s short for “duration”.↩︎
This is hairy because dates and times are hairy, not because lubridate is trying to make our lives miserable.↩︎
For plotting survival curves, which we usually want to do after fitting a model, the survival times actually need to be numbers.↩︎
Those formulas have estimated standard deviations in them, so the distribution you get P-values from is $t$ rather than normal, but the idea is the same.↩︎
which is itself based on the likelihood.↩︎