Challenge

Implement a jackknife procedure to produce a graphical display (such as a histogram or boxplot) of the \(R^2\) value for the model estimating weight as a function of Time.

See The Challege Description for details.

Solution

First, let’s make an object to store the results of our jackknife. There are a number of ways we can do this. I will use the simplest here, and then demonstrate some other approaches at the end.

For the simplest approach, we can create a vector that has as many elements as the number of rows in the data set.

n <- nrow(ChickWeight) # number of rows in data.
jack_knife_r_squared <- vector(mode = "numeric", 
                               length = n)

Next, we write a for loop that will omit one observation from the data on each build of the model.

for (i in seq_along(jack_knife_r_squared)){
  temp_fit <- lm(formula = weight ~ Time, 
                 data = ChickWeight[-i, ])
  jack_knife_r_squared[i] <- summary(temp_fit)$adj.r.squared
}

With the results in the vector, we may use the hist command to produce a quick histogram.

hist(jack_knife_r_squared)

A note about seq_along

When first learning to write R code, it can be tempting to write for loops with the form

for (i in 1:length(vector_of_interest)){
  ...
}

This can occasionally lead to frustrating errors if, for some reason, vector_of_interest has length 0. This might happen if there were no valid results in a previous step of the analysis. When this occurs, the loop tries to iterate over i = c(1, 0) and causes an error.

Initiating loops using seq_along avoids this problem because seq_along will construct the approriate iteration sequence to skip the loop if vector_of_interest has length zero. Compare the differences below.

vector_of_interest <- vector("numeric", length = 0)

1:length(vector_of_interest)
## [1] 1 0
seq_along(vector_of_interest)
## integer(0)

It may not seem like much of an issue now, but debugging these kinds of errors can be time consuming and frustrating.

Alternative Solutions

Results back to the data

We may also approach the problem by putting the results directly into the data frame.

ChickWeight$adj_r_sq_without_this_row <- rep(NA, length = nrow(ChickWeight))

for (i in seq_len(nrow(ChickWeight))){
  temp_fit <- lm(formula = weight ~ Time, 
                 data = ChickWeight[-i, ])
  ChickWeight$adj_r_sq_without_this_row[i] <- 
    summary(temp_fit)$adj.r.squared
}

This sets us up to use either the hist command or to use ggplot plotting tools. Differences in these plots are related to differences in how the two functions define the break points under their respective defaults.

hist(ChickWeight$adj_r_sq_without_this_row)

library(ggplot2)

ggplot(data = ChickWeight, 
       mapping = aes(x = adj_r_sq_without_this_row)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

apply functions

R features a family of functions that will apply a set of commands across a set of values. The functions perform all of the details of a for loop for you, while avoiding some of the traps that can trip you up when writing your own for loops. This requires having a function that will perform just one iteration of your loop.

jack_r_square <- function(row_to_omit){
  fit <- lm(formula = weight ~ Time, 
                 data = ChickWeight[-row_to_omit, ])
  summary(fit)$adj.r.squared
}

result <- sapply(X = seq_len(nrow(ChickWeight)), 
                 FUN = jack_r_square)

hist(result)