STAD29 Assignment 8

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

Packages

library(tidyverse)

Amino acid sequencing

A geneticist examined the protein molecule Cytochrome-C for a number of vertebrates. At each position along the molecule, one of a small number of amino acids is present. The geneticist then worked out a “genetic distance” between each pair of vertebrate species by counting the number of locations at which the two vertebrate species had a different amino acid. (This is the same idea as our measure of distance between languages in lecture.) The results are in http://datafiles.ritsokiguess.site/amino_distances.csv. The idea is that more closely related species will tend to have a smaller genetic distance between them.

(1 point) Read in and display (some of) the genetic distances.

No different from usual:

my_url <- "http://datafiles.ritsokiguess.site/amino_distances.csv"
amino_distances <- read_csv(my_url)

Rows: 17 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): species
dbl (17): Bullfrog, Chicken, Dog, Dogfish, Donkey, Duck, Horse, Human, Kanga...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

amino_distances

You see that this is a wide-format array containing the genetic distance between the species in the row and the species in the column (and that the genetic distance between a species and itself is zero).

Extra: there is a lengthy story of how I got the data to you in this form, but the strategy I used is identical to the story of working out the distances between languages in lecture, involving counting the number of somethings that are different.

Here’s where I began:

data(amino.acid.sequence.1972, package = "cluster.datasets")
amino.acid.sequence.1972

These are the sixteen positions on the molecule where the seventeen species don’t all have the same amino acid (the other positions don’t tell us anything about how the vertebrate species are related). You see that the species are playing the role of the languages in our lecture example, and the positions are playing the role of number words (or, more accurately, the first letters of the number words). Different positions don’t have the same collection of different amino acids, but our comparison will be within positions, so that is not a problem.

The first thing I noticed is that these data, from 1972, are from an era where humans were by default male humans (look at the top row). That definitely needs fixing, and doing so is a chance to show off something from the new dplyr:

amino.acid.sequence.1972 %>% 
  mutate(species = replace_values(
    species, 
    "Man" ~ "Human"
  )) -> amino_wide
amino_wide

In words, this is “take our original dataframe, and redefine species to have Human in it everywhere the old value is Man.” The new function replace_values is designed for making a few small changes in a dataframe, and the syntax is a column, and then a collection of changes in the form old ~ new. Everything else is left as is.

The next thing I did in the lecture example is to make things long:

amino_wide %>% 
  pivot_longer(-species, 
               names_to = "position", 
               values_to = "amino_acid") -> amino_long

amino_long

Now, counting the number of locations at which the amino acids differ is not simple, so as in lecture I am going to abstract that out into a function. The function below is a direct copy of the countdiff from the lecture notes, but with a few names changed to reflect what we are doing here. Compare the lecture one with this one:

countdiff <- function(sp1, sp2, d) {
  d %>% filter(species == sp1) -> sp1d
  d %>% filter(species == sp2) -> sp2d
  sp1d %>%
    left_join(sp2d, join_by(position)) %>%
    count(different = (amino_acid.x != amino_acid.y)) %>%
    filter(different) %>% pull(n) -> ans
  # if ans has length zero, set answer to (integer) zero.
  ifelse(length(ans)==0, 0L, ans) 
}

The inputs to the function are two species names and a dataframe (our long dataframe). The function makes dataframes for each species, then matches them by position, counts the number of amino acids that are different, and saves that (as a number). As in the lecture example, something funny happens if none of the amino acids are different (as, for example, in comparing a species with itself), so that is handled specially at the end.

I didn’t change anything much apart from some names, but let’s make sure it works:

countdiff("Human", "Monkey", amino_long)

[1] 1

countdiff("Human", "Human", amino_long)

[1] 0

If you look at the data, humans and monkeys only differ at position p.34, so those look right. I suspect my function is rather inefficient (as we’ll see later), but it seems to work.

Now I need to get all possible pairs of species, and to that, I need all the species, which are most easily gotten from the wide data:

species <- amino_wide$species
species

 [1] "Human"       "Monkey"      "Horse"       "Donkey"      "Pig"        
 [6] "Dog"         "Rabbit"      "Whale"       "Kangaroo"    "Chicken"    
[11] "Pigeon"      "Duck"        "Turtle"      "Rattlesnake" "Bullfrog"   
[16] "Tuna"        "Dogfish"

To make all possible pairs, as in lecture, I use crossing (which is like datagrid run with more than one variable):

pairs <- crossing(species = species, sp2 = species) 
pairs

I used a sort of disposable name for the second species, for reasons you will see shortly.

Next, for each of those pairs of species, run countdiff and save the results:

pairs %>% 
  rowwise() %>% 
  mutate(diff = countdiff(species, sp2, amino_long)) -> thediff
thediff

If you try this yourself, you’ll see that it actually takes a noticeable amount of time to run, undoubtedly because my function is not as efficient as it might have been.

We need an actual square table, not this long thing, which means pivoting sp2 wider (now you see why it had a name I didn’t really care about):

thediff %>% pivot_wider(names_from=sp2, 
                        values_from=diff) -> amino_distances
amino_distances

and that’s what I saved for you.

(2 points) Create a suitable object to be used as input for hierarchical clustering.

This is going to be input for hclust (looking ahead to the next question), so it needs to be a dist object. We already have distances (dissimilarities), so as.dist is called for:

amino_distances %>% 
  select(-species) %>% 
  as.dist() -> d

as.dist requires only columns of numbers, so take off the species column first. You could also do it like this:

amino_distances %>% 
  select(-species) -> x
d <- as.dist(x)

where “amino_distances without the species column” can have a disposable name because you’re only using it here.

You might also choose a better name than d for your dist object.

Extra: you should for yourself (but not to hand in) take a look at d:

            Bullfrog Chicken Dog Dogfish Donkey Duck Horse Human Kangaroo
Chicken           13                                                     
Dog               13       9                                             
Dogfish           22      19  16                                         
Donkey            15      10   4      16                                 
Duck              13       3   7      17      9                          
Horse             16      11   5      17      1   10                     
Human             16      12   9      23     10   10    11               
Kangaroo          14      11   5      19      7    9     6     8         
Monkey            16      13  10      22     11   11    12     1        9
Pig               13       9   2      16      2    8     3     9        5
Pigeon            14       4   8      19     10    3    11    11       10
Rabbit            13       8   4      17      5    6     6     8        5
Rattlesnake       23      16  17      24     18   14    19    14       17
Tuna              17      17  17      20     18   17    19    20       17
Turtle            12       8   8      19     10    7    11    14       10
Whale             13       9   2      16      4    7     5     9        5
            Monkey Pig Pigeon Rabbit Rattlesnake Tuna Turtle
Chicken                                                     
Dog                                                         
Dogfish                                                     
Donkey                                                      
Duck                                                        
Horse                                                       
Human                                                       
Kangaroo                                                    
Monkey                                                      
Pig             10                                          
Pigeon          12   9                                      
Rabbit           9   4      7                               
Rattlesnake     15  17     15     15                        
Tuna            21  17     18     17          24            
Turtle          15   9      8      9          19   18       
Whale           10   2      8      2          16   17      8

These are the dissimilarities arranged in a triangle, broken up into pieces so that all of it is displayed within the width of your screen. The grader, however, does not want to scroll through a hundred copies of that.

Extra extra: internally, d is actually one-dimensional. The display you just saw is called its “print method”; the person who designed dist objects wrote a special function that displays a dist object nicely to look at, but that has nothing to do with how it’s actually stored:¹

str(d)

 'dist' int [1:136] 13 13 22 15 13 16 16 14 16 13 ...
 - attr(*, "Labels")= chr [1:17] "Bullfrog" "Chicken" "Dog" "Dogfish" ...
 - attr(*, "Size")= int 17
 - attr(*, "call")= language as.dist.default(m = x)
 - attr(*, "Diag")= logi FALSE
 - attr(*, "Upper")= logi FALSE

If you compare this with the display of d, the numbers shown here are the the dissimilarities from the first column, reading down (and then are the ones from the second column, and so on, until the 8 right at the end). There are 136 dissimilarities because there are 17 vertebrate species and each one has 16 other species it can be paired with, but there are actually only half that number of pairs because eg. human and monkey is the same pair as monkey and human:

17 * 16 / 2

[1] 136

(3 points) Obtain a hierarchical cluster analysis using Ward’s method, and display the dendrogram.

Use hclust with the right thing in method:

d.1 <- hclust(d, method = "ward.D")

and, because hclust goes back to the early days of R, drawing a dendrogram uses the base R plot rather than ggplot:

plot(d.1)

Extra: there are ways to use ggplot to plot these, but they live in other packages, and I didn’t want to complicate the issue any more than necessary. To my mind, these other packages don’t really offer much improvement over the base plot, for example:

library(dendextend)


---------------------
Welcome to dendextend version 1.17.1
Type citation('dendextend') for how to cite the package.

Type browseVignettes(package = 'dendextend') for the package vignette.
The github page is: https://github.com/talgalili/dendextend/

Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
You may ask questions at stackoverflow, use the r and dendextend tags: 
     https://stackoverflow.com/questions/tagged/dendextend

    To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
---------------------


Attaching package: 'dendextend'

The following object is masked from 'package:stats':

    cutree

d.1 %>% 
  as.dendrogram() %>% 
  plot()

This plot(), confusingly, comes from dendextend, and is not the same as the base R plot we used before.

(2 points) Which are the first two species to be combined into a cluster? You will need to obtain some more output to get the unique answer to this question. Describe how you found your answer.

Look at the clustering history:

d.1$merge

      [,1] [,2]
 [1,]   -5   -7
 [2,]   -8  -10
 [3,]   -3  -11
 [4,]  -17    3
 [5,]   -2   -6
 [6,]  -12    5
 [7,]  -13    4
 [8,]   -9    7
 [9,]    1    8
[10,]  -16    6
[11,]   -1  -15
[12,]  -14    2
[13,]   -4   11
[14,]    9   10
[15,]   12   14
[16,]   13   15

This shows which species (negative numbers) are combined into clusters, and (later) which clusters (positive) numbers are combined together. In this case, species numbers 5 and 7 are the first to be combined into a cluster. Which are these? Look at the labels:

d.1$labels

 [1] "Bullfrog"    "Chicken"     "Dog"         "Dogfish"     "Donkey"     
 [6] "Duck"        "Horse"       "Human"       "Kangaroo"    "Monkey"     
[11] "Pig"         "Pigeon"      "Rabbit"      "Rattlesnake" "Tuna"       
[16] "Turtle"      "Whale"

Species 5 is donkey and 7 is horse.

You can check your answer by seeing that these two species are combined together at the bottom of the dendrogram, and you can check your procedure by seeing that human and monkey should be next. Human is species #8 and monkey is #10, and those are indeed on the second line of the merge matrix.

(2 points) Use your output from the previous question to find out which individual species is the last to be added to a cluster. Describe how you found your answer.

Read the merge table up from the bottom. The bottom three lines are of clusters being joined together (for example, in row 14, the clusters formed at steps 9 and 10 are combined). But in row 13, species 4 is combined with the cluster formed at step 11. Look at the labels to see that species 4 is Dogfish.

You could also have determined this from the dendrogram; this is the species whose name is printed the highest up the page. But I wanted you to show that you knew how to use the merge table to answer the question. (There is, of course, nothing stopping you from using the dendrogram to check your answer once you have it, or even from saying to yourself “well, the answer should be Dogfish” and then seeing how you could get it from the merge output.)

(3 points) Redraw your dendrogram with 4 clusters shown on it. Repeat with 7 clusters.

This is a base R plot, and the way you add things to a base R plot is consecutive lines within a code chunk. So start by re-drawing your dendrogram, and then on the next line put the appropriate rect.hclust. This is 4 clusters:

plot(d.1)
rect.hclust(d.1, 4)

and this is 7:

plot(d.1)
rect.hclust(d.1, 7)

(2 points) Based on what you know or can find out about these vertebrate species, do you prefer the 4-cluster or the 7-cluster solution? Explain briefly.

The individuals (species, here) within a cluster should be similar, and individuals within different clusters should be different (in some sense).

Looking at the 4-cluster solution, you would expect humans and monkeys to be in the same cluster (they have a lot of similarities as animals), but you would not expect rattlesnakes to be with them in the same cluster at all. The other clusters you might be able to live with: “live in water” for my first cluster, “land animals” for the third one,² and “descended from dinosaurs” for the fourth one.

Because this is hierarchical clustering, the 7-cluster solution is obtained from the 4-cluster one by splitting some of the clusters. In this case, what happened is that rattlesnakes got split off from humans and monkeys (a plus), and the dogfish-bullfrog-tuna cluster got split up with those species in clusters on their own. As I write this, I’m not sure biologically whether this makes sense (you can argue that it does), but judging by the height on the dendrogram that they split, they probably are about as dissimilar from each other as rattlesnakes are from monkeys and humans.

So I prefer the seven-cluster solution, because the four-cluster solution combines things that seem to be different. Make an argument along these lines with at least one example supporting your case. I phrased the question as I did because I do not want you to base your call on the dendrogram only; I want you to have a biological reason for choosing one number of clusters over the other for these data.

Extra: I went to Wikipedia³ to find out more about those species on the left (in my dendrogram):

dogfish are sharks, so they are not fish at all.⁴
bullfrogs are in common usage large aggressive frogs. I suspect the “bullfrog” here is something like the American bullfrog, but there are different bullfrogs found in different parts of the world.
tuna are actual fish.

These three don’t seem to have much in common apart from living in water, so I would support putting them in different clusters (another reason to prefer seven clusters rather than four).

Additional note: I was vaguely surprised to discover that fish are vertebrates, and then I thought about what happens when you eat a (whole, cooked) fish: you have to remove the bone, which is the backbone. I went hunting for some examples of invertebrates (no backbone) and found this list; most of the examples there are insects or crustaceans; the name of the latter means “has a shell”, and most of them live in water (but are not fish, even though they are often called “shellfish”).

Exoplanets

Astronomers have discovered planets outside the Solar System, so-called “exoplanets”. Exoplanets are not well understood, but it is possible to measure some properties of the exoplanets:

mass: the mass of the planet relative to the mass of Jupiter
period: how long the planet takes to rotate on its axis, relative to the length of a day on Earth
eccen: how close the orbit of the planet is to a circle (eccentricity zero); planets otherwise have orbits in the shape of an ellipse. A long skinny ellipse shape will have a large eccentricity; an ellipse that is close to a circle will have a small one.

The exoplanets have numerical labels, in column id.

It was found that the logs of mass and period, and the square root of eccen, had close to normal distributions, so we will use those variables (that is, log_mass, log_period, and sqrt_eccen) in our analysis. The data are in http://datafiles.ritsokiguess.site/exoplanets.csv.

(1 point) Read in and display (some of) the data.

my_url <- "http://datafiles.ritsokiguess.site/exoplanets.csv"
planets <- read_csv(my_url)

Rows: 101 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (7): id, mass, period, eccen, log_mass, log_period, sqrt_eccen

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

planets

There are 101 exoplanets here. Each one has an id, along with its three original measurements of mass, period, and eccentricity, and the transformed values that we will use. Note for later that the transformed variables all have an underscore in their names.

Extra: there is a data story here as well (but not as long as for the other dataset). These are the planets dataset from the HSAUR2 package, which only contains the original variables:

planets0 <- HSAUR2::planets
planets0

You immediately see that some of the period values (especially) are much bigger than the others, and so things may behave better if we transform our variables first.

I realized that you can find a good transformation for a single variable the same way you do for the response variable in a model: use Box-Cox. To do it this way, you use a “model” that has just an intercept, like this for mass:⁵

library(MASS, exclude = "select")
boxcox(mass ~ 1, data = planets0)

There is no problem with using a log transformation here, and no good other choices. (There are 101 observations, so we can estimate the transformation pretty well.)

Let’s try period:

boxcox(period ~ 1, data = planets0)

This is not as convincing: 0 is just outside the interval, but there are not really any better choices (unless you like fourth root, power 0.25), so I’m going to go with log here as well.

The other variable eccen needs a bit more thought. Some of the orbits, to the accuracy that the astronomers could measure them, were exactly circular, so the eccen value is exactly zero, and you can’t use Box-Cox with zero values because you might end up taking log of zero:

boxcox(eccen ~ 1, data = planets0)

Error in boxcox.default(eccen ~ 1, data = planets0): response variable must be positive

The standard way around this is to add a “little something” to each value before transforming. What “little something”? I decided to find the smallest five⁶ non-zero values of eccen:⁷

planets0 %>% filter_out(eccen == 0) %>% 
  slice_min(eccen, n = 5)

and then I thought “what happens if you add half the smallest value, that is 0.005?”

boxcox((eccen + 0.005) ~ 1, data = planets)

and that suggests square root (power 0.5). This is an arbitrary choice of a value to add, so I also tried 0.001 and 0.01, and those came out with almost the same picture, so I figured that exactly what you added didn’t actually matter much. Hence, for you, I used log of the first two variables, and square root of (eccen plus 0.005) and added an id column which I put first:

planets0 %>% 
  mutate(log_mass = log(mass),
         log_period = log(period),
         sqrt_eccen = sqrt(eccen + 0.005)) %>% 
  mutate(id = row_number()) %>% 
  select(id, everything()) -> planets00
planets00

and that was the dataset I saved for you.

The new variables are not obviously skewed:

planets00 %>% 
  pivot_longer(contains("_"), names_to = "var_name", values_to = "var_value") %>% 
  ggplot(aes(x = 1, y = var_value)) + geom_boxplot() +
  facet_wrap(~ var_name, scales = "free")

or you could do this:

planets00 %>% 
  pivot_longer(contains("_"), names_to = "var_name", values_to = "var_value") %>% 
  ggplot(aes(sample = var_value)) + stat_qq() + stat_qq_line() + 
  facet_wrap(~ var_name, scales = "free")

(the pivot-longer is to make plots of all three transformed variables at once.) Two of the variables are actually short-tailed, or possibly bimodal, but we don’t need perfect normality for what we’ll be doing.

(3 points) Create and save a new dataframe that contains the \(z\)-score values of only the three variables of interest. Hint: what do the names of your variables of interest have in common?

So, select the columns of interest first. They all have an underscore in their names (the hint), and scale finds \(z\)-scores, so:

planets %>% 
  select(contains("_")) %>% 
  mutate(across(everything(), \(x) scale(x))) -> planets_scaled
planets_scaled

Alternatively, do the scale first, by putting the contains inside the across, and then select the columns you want at the end, but then you have contains twice. Get hold of a dataframe that looks like this one, somehow. (Check that your resulting values look like \(z\)-scores.)

Extra: taking \(z\)-scores does nothing to make a distribution more normal. It just scales it to have a common mean (0) and SD (1). If we had done this without doing the transformations first, the distributions would still have been right-skewed.

(2 points) Run a K-means cluster analysis with 5 clusters, and display the output. Hint: how do you make sure an output based on randomness comes out the same when you re-run it?

Remember that K-means is a random algorithm so run it several times (using nstart) and take the best result. Remember to use the dataframe with the \(z\)-scores. The hint is suggesting that you use set.seed. I’m using my standard seed (use yours):

set.seed(457299)
planets.5 <- kmeans(planets_scaled, 5, nstart = 20)
planets.5

K-means clustering with 5 clusters of sizes 22, 25, 8, 20, 26

Cluster means:
    log_mass log_period sqrt_eccen
1  0.1408163  0.6152652 -0.6258151
2 -0.2309590  0.3392567  0.7000629
3 -1.4670349 -0.6216399  0.1820100
4 -0.8171765 -1.5976594 -1.3522923
5  1.1829163  0.5734252  0.8406202

Clustering vector:
  [1] 4 4 3 3 4 4 4 3 4 4 4 3 4 3 4 4 1 3 1 2 2 3 2 4 4 4 1 2 2 2 2 2 4 2 2 2 1
 [38] 1 2 2 4 1 2 1 2 2 2 2 2 1 1 1 4 2 2 3 2 1 1 1 1 1 1 2 2 1 1 2 5 4 5 5 4 1
 [75] 5 4 5 5 5 5 5 1 1 1 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

Within cluster sum of squares by cluster:
[1] 17.733793 11.580049  9.198428 20.168507 22.618257
 (between_SS / total_SS =  72.9 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

Your output will probably look a bit different from mine (because randomness). You will have five clusters, and they should be the same sizes as mine, but they won’t (probably) be in the same order.

Extra: we have, at this point, no idea whether 5 clusters is a good number. To choose a better number of clusters, we will have to make a scree plot (coming up later).

(2 points) Which of your clusters has the fewest exoplanets in it? From your output, what seems to be distinctive about this cluster, in terms of the original variables?

This is my cluster 3, with 8 exoplanets in it (8 observations). Look at the cluster means, and bear in mind that we are working with \(z\)-scores now, so that any cluster mean far away from zero is interesting.

Hence, the exoplanets in cluster 3 have small mass (and, optionally, smaller-than-average period, though not as small as cluster 4, and eccentricity close to the mean, but this last is not “distinctive”).

(3 points) List the IDs of the exoplanets in the cluster you found in the previous question.

Get the IDs (in my planets) and the cluster numbers (in my planets.5) together into one dataframe, for example as below, and then grab the ones in the numbered cluster you found above:

tibble(id = planets$id, cluster = planets.5$cluster) %>% 
  filter(cluster == 3)

Since we know which cluster these are in, you could also do this to just list the IDs:

tibble(id = planets$id, cluster = planets.5$cluster) %>% 
  filter(cluster == 3) %>% 
  pull(id)

[1]  3  4  8 12 14 18 22 56

although since there are only 8 exoplanets in this cluster, the grader can see them all without scrolling.

(4 points) Obtain a scree plot for 2 through 20 clusters. What seems to be a good number of clusters to use, different from the 5 clusters you used before? Hint: to work out the total within-cluster sum of squares, you can use the function in the lecture notes.

The function in question is this one. Copy-paste it first:

ss <- function(i, d) {
  d %>%
    select(where(is.numeric)) %>%
    kmeans(i, nstart = 20) -> km
  km$tot.withinss
}

This takes as input a number of clusters and a dataframe to run kmeans on (that is to say, our standardized one planets_scaled).

Next, make a dataframe with the numbers of clusters you want on your scree plot, and (rowwise) work out the total within-cluster sum of squares for each:

tibble(n_cluster = 2:20) %>% 
  rowwise() %>% 
  mutate(ssq = ss(n_cluster, planets_scaled)) -> w
w

No idea why I called my dataframe w. Anyway, next plot the values that I called ssq against the values I called n_cluster, joined by lines:

ggplot(w, aes(x = n_cluster, y = ssq)) + geom_point() + geom_line()

Finally, look for an elbow and say why you picked that one. Your graph may not look exactly like mine (randomness), but you should try to find a downward-pointing elbow that is (i) reasonably far “down the mountain” and (ii) has a number of clusters that is not too large (realizing that (i) and (ii) are in conflict, so you will have to compromise). I see elbows at 6 and 10 clusters; I think it is reasonable to choose either one of these. With 101 observations, you might even be able to justify more than 10 clusters. I’m going to go with 10 clusters.

The grader will look at whether you appear to have chosen a sensible number of clusters from your scree plot, which might differ from mine.

(2 points) Run K-means with your chosen number of clusters, and display the output.

This is an exact copy of what you did with 5 clusters before. Copy, paste, and edit:

planets.10 <- kmeans(planets_scaled, 10, nstart = 20)
planets.10

K-means clustering with 10 clusters of sizes 16, 10, 1, 6, 10, 18, 12, 12, 10, 6

Cluster means:
      log_mass log_period  sqrt_eccen
1   0.18258945  0.4091499  0.06188516
2  -0.05106812  0.7682215 -1.10255320
3  -3.11088680  0.8585409 -0.30146692
4   1.44591431 -0.4612437  0.42811825
5   0.87781199  0.5482803  1.36506976
6  -0.37233703  0.4202892  0.82777388
7  -1.35743424 -1.8015250 -1.53773694
8   1.24197896  1.0766597  0.43551370
9  -0.06746311 -1.1313599 -0.98409883
10 -1.33188468 -0.8926195  0.38086112

Clustering vector:
  [1]  7  7 10 10  7  7  7 10  7  7  7 10  7 10  7  7  2 10  2  6  6  9  6  7  9
 [26]  9  2  6  6  6  6  6  9  6  1  6  2  2  6  6  9  2  6  1  1  6  6  6  6  1
 [51]  1  9  9  6  1  3  1  1  2  2  1  1  1  1  1  1  1  5  5  9  1  5  9  8  5
 [76]  9  8  8  5  5  5  2  2  4  5  8  4  8  4  8  8  5  8  8  5  4  8  8  4  8
[101]  4

Within cluster sum of squares by cluster:
 [1] 4.304753 6.430403 0.000000 3.669618 4.300802 6.994318 4.703544 4.279042
 [9] 6.160742 2.065460
 (between_SS / total_SS =  85.7 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

Extra: I will discover (in a later Extra) that most of the exoplanets that were in my cluster 3 before are in cluster 10 here. This time, cluster 10 also has low mass on average, but now cluster 7 also has low mass and low period, and the new cluster 3 (that only has one observation in it) has low mass as well. The suggestion is that the one observation that’s in cluster 3 here got split off from cluster 3 before (we explore that later).

(4 points) For one of the exoplanets you found in Question 12 (it does not matter which one), find out which cluster it is in in your solution of Question 14. Then find out which other exoplanets are in that same cluster in Question 14. Does the cluster membership appear to be similar in Questions 12 and 14? Explain briefly.

Several things to do. Let’s remind ourselves first of which exoplanets were in our small cluster before:

tibble(id = planets$id, cluster = planets.5$cluster) %>% 
  filter(cluster == 3)

All right, so I’ll pick exoplanet 22. Which cluster is that in now? To find out, make a dataframe just like above, but this time getting the cluster from the analysis you just did (that I called planets.10):

tibble(id = planets$id, cluster = planets.10$cluster) -> clusters.10

and then search for exoplanet 22:

clusters.10 %>% filter(id == 22)

It’s in my cluster 9 now. What else is in cluster 9?

clusters.10 %>% filter(cluster == 9)

For me, I don’t think exoplanet 22 has any exoplanets in common between cluster 9 now and my cluster 3 before.

For the grader: anything that appears to be the right process is fine here:

Pick an exoplanet that was in the small cluster before (any one)
Find out which cluster it’s in now
Find the other exoplanets that are in that same cluster.
Comment on the similarity or difference between the two clusters.

Extra: to get a sense of what correspondence there is, if any, between the 5-cluster solution before and the 10-cluster solution now, make a dataframe that has the cluster membership of both solutions, then count up what’s where:

tibble(id = planets$id, 
       cluster5 = planets.5$cluster, 
       cluster10 = planets.10$cluster) -> compare
compare

and then

with(compare, table(cluster5, cluster10))

        cluster10
cluster5  1  2  3  4  5  6  7  8  9 10
       1  9 10  0  1  0  0  0  1  1  0
       2  6  0  0  0  1 18  0  0  0  0
       3  0  0  1  0  0  0  0  0  1  6
       4  0  0  0  0  0  0 12  0  8  0
       5  1  0  0  5  9  0  0 11  0  0

Most of my old cluster 3 has gone to my new cluster 10. I happened to pick one of the exoplanets that didn’t.

Because this is not hierarchical clustering, you don’t get to a 10-cluster solution by taking a 5-cluster solution and splitting it; K-means starts afresh for each number of clusters. What does tend to happen is that some of the individuals in the same cluster will remain in the same cluster, and some of them will split off. Here:

the old cluster 1 split into the new clusters 1 and 2
most of the old cluster 2 is the new cluster 6, but some went to the new cluster 1
most of the (small) old cluster 3 went to the new cluster 10
the old cluster 4 split into the new clusters 7 and 9
the old cluster 5 split into the new clusters 4, 5, and 8.

There are only a few frequencies of 1 that are an exoplanet splitting off from the cluster it was in before, and only one of the new clusters (cluster 1) has more than one member from two different of the old clusters. So it was in fact mostly the case that the old clusters split to make the new ones, but with a few exceptions.

Some of the new clusters are small: for example, the new “cluster” 3 has only one exoplanet in it!

We have the machinery to find the six of the exoplanets we found before that went to the new cluster 10:

compare %>% 
  filter(cluster5 == 3, cluster10 == 10)

Six of the eight exoplanets we found before, but not number 22 or number 56. These were the ones that got split off:

compare %>% 
  filter(id == 22 | id == 56)

Points: \((1+2+3+2+2+3+2) + (1+3+2+2+3+4+2+4) = 15 + 21 = 36\).

Footnotes

str looks at the “structure” of an object.↩︎
Or possibly “animals that have Chinese years named after them”!↩︎
This is usually a reliable source to start finding out about anything. If you want more detail, there are almost always further references. One of the ways to assess whether a source is reliable is to “go sideways”: look at other information on the same site, such as things you already know about, and assess that for accuracy and bias. If the things you know about are treated accurately, the things you don’t are likely to be as well.↩︎
Or dogs.↩︎
Of course, you realize that MASS is a package that just happens to have the same name as one of our variables.↩︎
There are seven rows displayed here because the output includes all the value tied for fifth-smallest as well.↩︎
This is a nice use of the new filter_out, but you can use filter with eccen != 0 if you prefer.↩︎