STAD29 Assignment 2

Packages

library(tidyverse)
library(MASS, exclude = "select")
library(marginaleffects)

Stop Online Piracy Act

In 2011, the US government introduced a bill to expand the ability of law enforcement to combat online copyright infringement. The ProPublica non-profit news organization compiled a dataset surveying how each legislator was likely to vote on the bill (in the column stance), along with some other information about each legislator. The other columns of interest to us are:

party: D (Democrat), R (Republican)
money_pro: Money in dollars donated to the legislator’s 2010 campaign by groups in favour of the bill (such as movie and TV studios and record labels)
money_con: Money in dollars donated to the legislator’s 2010 campaign by groups opposed to the bill (such as computer and internet companies).
years of service in the legislature.

Our aim is to predict a legislator’s stance from the other variables.

The data are in https://datafiles.ritsokiguess.site/piracy.csv, but will need tidying first.

(1 point) Read in and display (some of) the data.

As usual:

my_url <- "https://datafiles.ritsokiguess.site/piracy.csv"
piracy <- read_csv(my_url)

Rows: 534 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): name, party, state, stance, chamber
dbl (3): money_pro, money_con, years

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

piracy

From a 2026 viewpoint, this interest in copyright seems rather quaint, but groups opposed to a bill like this were a lot less powerful in 2011 than they are now.

(3 points) How many legislators had each stance (at the time of the survey)? What do you notice about the way the stances are listed? Why is this likely to be a problem later? (Hint: look ahead at the kind of model that we will be fitting.)

Three things, one point each:

get hold of “how many”, which is just count
say how the stances are listed
say how this will be a problem later.

First, how many:

piracy %>% count(stance)

Second: the stances are listed in alphabetical order, which happens in this case not to be a logical order.

Third: This is, of itself, what you would expect (the values in stance were read in as text), but looking ahead, you can see that this is going to be a problem, because we are going to fit an ordered logistic regression, and we will need the values of stance to be in a logical order, such as starting from the most negative and ending at the most positive.

This last, we fix up later.

(2 points) Why is this dataset suitable for an ordinal logistic regression? Hint: ignore unknown for the purposes of this question.

Two things:

the response variable stance is categorical, with more than two levels: ignoring unknown, there are four levels.
those levels come in a natural order, once you ignore unknown: from most negative to most positive (or the other way around).

Say those two things, and something about how you know. It is not helpful to assert that “the levels are ordered” without also saying something about how you know this.

(2 points) Create a new column stance_f that has the values of stance in a sensible order. Hints:

Use factor. This has a second optional input levels that lists the levels in the order you want them.
We do not want to include the legislators whose stance is unknown, so omit this from your levels.
“Leaning no” means that the legislator at the time of the survey intended to vote no, but was open to being persuaded otherwise. This is not as negative as a straight “no”.

Following the hint leads you to something like this:

piracy %>% 
  mutate(stance_f = factor(stance, 
                           levels = c("no", "leaning no", 
                                      "undecided", "yes"))) -> piracy
piracy

Save the result somewhere because you will need it again.

I arranged the levels from most negative to most positive.

You can have the levels in the reverse order. Everything should work, though your output for the predictions later might look different (the columns will be in the opposite order).

Extra: in this case, fct_inorder does not work, because the stance values in the original data are not in a sensible order. (The legislators are listed in alphabetical order of last name, which has nothing to do with their opinion on this bill.) Go back and look at your lecture notes to see how the lecture example was different.

Extra extra: the levels of stance_f are now in a sensible order:

piracy %>% count(stance_f)

We are about to get rid of the missing ones.

(1 point) There are some missing values in your dataset. Remove all the legislators that have any missing values in any variables.

This is drop_na with no columns named. You might observe (for yourself) that there are missing values for two reasons: one, that the original data had some missing values, and two, the legislators whose stance was unknown now have missing values in stance_f:

piracy %>% drop_na() -> piracy

Extra: to verify what I said, I surreptitiously created a copy of piracy before dropping the missing values:

summary(piracy0)

     name              party              state             money_pro     
 Length:534         Length:534         Length:534         Min.   : -5000  
 Class :character   Class :character   Class :character   1st Qu.:  4500  
 Mode  :character   Mode  :character   Mode  :character   Median : 11700  
                                                          Mean   : 26326  
                                                          3rd Qu.: 27462  
                                                          Max.   :571600  
                                                          NA's   :14      
   money_con          years          stance            chamber         
 Min.   : -1000   Min.   : 1.00   Length:534         Length:534        
 1st Qu.:  4500   1st Qu.: 4.00   Class :character   Class :character  
 Median : 10200   Median :10.00   Mode  :character   Mode  :character  
 Mean   : 23193   Mean   :11.76                                        
 3rd Qu.: 23947   3rd Qu.:18.00                                        
 Max.   :550000   Max.   :58.00                                        
 NA's   :35                                                            
       stance_f  
 no        :122  
 leaning no: 44  
 undecided : 11  
 yes       : 63  
 NA's      :294

Some of the legislators did not disclose their campaign donations, and a lot of them had a stance that was unknown, which produced a lot of missing values in stance_f.

Compare that with our new version, which is free of missing values anywhere, but has a lot fewer rows:

summary(piracy)

     name              party              state             money_pro     
 Length:227         Length:227         Length:227         Min.   : -5000  
 Class :character   Class :character   Class :character   1st Qu.:  5000  
 Mode  :character   Mode  :character   Mode  :character   Median : 13150  
                                                          Mean   : 36621  
                                                          3rd Qu.: 35850  
                                                          Max.   :571600  
   money_con          years         stance            chamber         
 Min.   : -1000   Min.   : 1.0   Length:227         Length:227        
 1st Qu.:  5075   1st Qu.: 4.0   Class :character   Class :character  
 Median : 11400   Median : 8.0   Mode  :character   Mode  :character  
 Mean   : 27301   Mean   :10.7                                        
 3rd Qu.: 26551   3rd Qu.:16.0                                        
 Max.   :348691   Max.   :48.0                                        
       stance_f  
 no        :115  
 leaning no: 44  
 undecided : 10  
 yes       : 58

(2 points) Fit a suitable model to predict stance from the other variables of interest. Hint: why did you create stance_f above? (You do not need to display the results.)

Use polr, and make sure to use stance_f as your response (the reason for the hint):

piracy.1 <- polr(stance_f ~ party + money_pro + money_con + years, data = piracy)

(2 points) Demonstrate that one of your explanatory variables can be removed from your model, and fit a second model with this variable removed.

Much the easiest way to do this is with drop1:

drop1(piracy.1, test = "Chisq")

This says that once you know about a legislator’s party and donations, their number of years of service has no effect on their stance towards this bill (P-value 0.808), and this is the largest P-value.

So, years has to be removed. I like update for this, though you can copy, edit, and paste if you prefer:

piracy.2 <- update(piracy.1, . ~ . - years)

To get two points, you need to have a good reason to remove years rather than one of the other explanatory variables, and to my mind, drop1 is much the best way to come up with such a reason. If you don’t think of drop1, but you decide to remove years for no apparent reason, the best you can do is one point, if you can demonstrate that removing years is actually reasonable and the grader can follow your logic. One way to demonstrate this might be to show that your second model fits no worse than the first one with anova:

anova(piracy.2, piracy.1, test = "Chisq")

and then to explain that because the two models fit equally well, there is no value in keeping years (or that years does not add to the model over and above what is already there).

Extra: party is still only marginally significant:

drop1(piracy.2, test = "Chisq")

with a P-value of 0.081, but leaving it in makes a later question more interesting.

You need to remove the least significant explanatory variable, namely years, because party is close to significance and its P-value might drop to less than 0.05 when you remove years. If your P-values are in scientific notation like mine, you want the least negative power of 10 with the largest number on the front.¹

(4 points) For each combination of party and the two money variables (use the values D and R for party and 5000 and 25000 for the two money variables), obtain predicted probabilities of each stance, and display these predictions in a way that will allow you to assess the effect of each explanatory variable. (You do not need confidence limits for the predicted probabilities.) Hint: datagrid will give you each combination of input values.

Three steps:

First, use datagrid (as per the hint) to obtain the combinations of the explanatory variable values, in new as per tradition:

new <- datagrid(model =  piracy.2, 
                party = c("D", "R"), 
                money_pro = c(5000, 25000), 
                money_con = c(5000, 25000))
new

You see all \(2^3 = 8\) combinations.

Second, obtain the predictions, and keep the columns you want:

cbind(predictions(piracy.2, newdata = new)) %>% 
  select(group, estimate, party, money_pro, money_con)


Re-fitting to get Hessian


Re-fitting to get Hessian

Warning: Unable to extract a variance-covariance matrix from this model.

Remember to keep the column called group, because this is the one that says which response category you are predicting for.

Third, rearrange the display so that all the predictions for one explanatory variable combination are in one row (so that your predictions are laid out the same way new was). You can combine the second and third steps as I have:

cbind(predictions(piracy.2, newdata = new)) %>% 
  select(group, estimate, party, money_pro, money_con) %>% 
  pivot_wider(names_from = group, values_from = estimate)


Re-fitting to get Hessian


Re-fitting to get Hessian

Warning: Unable to extract a variance-covariance matrix from this model.

If you are lucky, you’ll be able to see all four probabilities in each row on your display, which will make comparison easier.

(2 points) Describe the effect of a legislator’s political party on their stance towards this bill, all else equal.

Two things:

compare the probabilities for the relevant two rows
make an overall summary of how the two parties compare in terms of position on the scale, focusing on the largest differences.

Pick two rows where the party differs but the two money variables are the same, such as row 1 and row 5. The predicted probability of no is clearly higher for the Republican, of leaning no is slightly higher for the Democrat, of undecided is about the same (and small in both cases), and of yes is clearly higher for the Democrat.

In summary, a Democrat is more likely to be at the Yes end of the scale, and a Republican is more likely to be at the No end of the scale. (The Democrat-Republican No difference is bigger than the Leaning No difference that goes the other way, so the Republicans are still more No than the Democrats are.)

(3 points) Explain briefly how the two money variables have the effects that you would expect, all else equal.

I would expect that a legislator receiving a larger amount of donations from the pro side (larger money_pro) would be further towards the Yes end, and a legislator receiving more donations from the con side would be further towards the No end.

To assess the effect of money_pro, fix the other two variables and assess the change in stance of a change in money_pro (for example, the first and third rows). When money_pro is bigger, the legislator is more likely to vote Yes and less likely to vote No: they are more likely to be at the Yes end.

The variable money_con goes similarly: for example, compare the seventh and eighth rows where only money_con differs: for the higher value, the probability of No is much higher and of Yes is much lower.

In other words, a legislator will tend to vote in the direction of their donation money, whichever that is.

Another way to assess this (if you can justify your choice of looking at it this way) is to compare a high value on one money variable vs. a low value on the other one (such as the sixth and seventh rows). This gets at the idea of “where a legislator got most of their donations from”. When most of the money is from the pro side, the legislator is more likely to be at the Yes end, and when most of the money is from the con side, the legislator is more likely to be at the No end; once again, the same end of the scale as the donation money.

Extra 1: another way of assessing the effects of the explanatory variables is to do three sets of predictions, changing only one variable at a time. This has the effect of using “average” values for the other variables (the mean for quantitative variables and the most common category for categorical ones).

To assess the effect of party, for example, you supply values only for that one variable and let datagrid supply values for the others:

new <- datagrid(model = piracy.2, party = c("D", "R"))
new

The values for the money variables are their means. To get this, you have to use datagrid; if you try to do it with, say, tibble, you will get only a column party.

Then predict and reorganize:

cbind(predictions(piracy.2, newdata = new)) %>% 
  select(estimate, group, party, money_pro, money_con) %>% 
  pivot_wider(names_from = group, values_from = estimate)


Re-fitting to get Hessian


Re-fitting to get Hessian

Warning: Unable to extract a variance-covariance matrix from this model.

and you get the same story as before, but you only have to compare these two rows to do it: Republicans are more likely to vote No, all else equal, and Democrats are more likely to vote Yes, so Democrats are more likely to be at the Yes end of the scale.

The two money variables go the same way: supply the values for the variable whose effect you are assessing, and let datagrid supply the others:

new <- datagrid(model = piracy.2, money_pro = c(5000, 25000))
new

and then the prediction code is exactly the same, but with the new new, so the answer is different:

cbind(predictions(piracy.2, newdata = new)) %>% 
  select(estimate, group, party, money_pro, money_con) %>% 
  pivot_wider(names_from = group, values_from = estimate)


Re-fitting to get Hessian


Re-fitting to get Hessian

Warning: Unable to extract a variance-covariance matrix from this model.

The more money_pro, the further towards the Yes end of the scale: more likely to vote yes, less likely to vote no. (The prediction is actually for Republican legislators, but the same pattern would be shown for Democrats, with the whole picture being shifted further towards the Yes end.)

Finally, repeat for the other money variable, where we would expect things to be the other way around:

new <- datagrid(model = piracy.2, money_con = c(5000, 25000))
new

and

cbind(predictions(piracy.2, newdata = new)) %>% 
  select(estimate, group, party, money_pro, money_con) %>% 
  pivot_wider(names_from = group, values_from = estimate)


Re-fitting to get Hessian


Re-fitting to get Hessian

Warning: Unable to extract a variance-covariance matrix from this model.

and so it is: the more money_con, the further towards the No end (predicted probability of voting No is higher and of voting Yes is lower).

Extra 2: I didn’t have you draw a graph of the predictions because there was a lot of data-tidying to do first. The only difficulty here is that you have three explanatory variables, and you have to decide how to arrange them, as well as the group outcome category. I’m going to put money_pro on the \(x\)-axis, group as colour, and party and money_con as facets (which means that plot_predictions will choose some representative values for money_con). I’m not claiming that this is the only choice, but that is mine:

plot_predictions(model = piracy.2, 
                 condition = c("money_pro", "group", "party", "money_con"))


Re-fitting to get Hessian


Re-fitting to get Hessian

Warning: Unable to extract a variance-covariance matrix from this model.

Ignoring unknown labels:
• fill : "group"

The money_con values are increasing down the page:

the party differences (left vs right) are hard to see, but I think the Democrats are more likely to be Yes (purple) and the Republicans are more likely to be No (red) for any given value of money_pro (\(x\)-axis) and money_con (downward facet). (The I in the middle is an independent: there is exactly one of these.)
for money_pro: all the facets are showing Yes to be more likely if this is higher and No to be more likely if it is lower.
for money_con: as you read down, the Yes moves further the right, and No becomes more likely for a larger range of values of money_pro.

This, I think, is about what you would expect once you disentangle it.

Points: \(1+3+2+2+1+2+2+4+2+3 = 22\)

Footnotes

The technical term for the 8.079 part is “mantissa”, with the power of 10 being called the “exponent”.↩︎