Implement a jackknife procedure to produce a graphical display (such as a histogram or boxplot) of the \(R^2\) value for the model estimating weight
as a function of Time
.
See The Challege Description for details.
First, let’s make an object to store the results of our jackknife. There are a number of ways we can do this. I will use the simplest here, and then demonstrate some other approaches at the end.
For the simplest approach, we can create a vector that has as many elements as the number of rows in the data set.
n <- nrow(ChickWeight) # number of rows in data.
jack_knife_r_squared <- vector(mode = "numeric",
length = n)
Next, we write a for
loop that will omit one observation from the data on each build of the model.
for (i in seq_along(jack_knife_r_squared)){
temp_fit <- lm(formula = weight ~ Time,
data = ChickWeight[-i, ])
jack_knife_r_squared[i] <- summary(temp_fit)$adj.r.squared
}
With the results in the vector, we may use the hist
command to produce a quick histogram.
hist(jack_knife_r_squared)
seq_along
When first learning to write R code, it can be tempting to write for
loops with the form
for (i in 1:length(vector_of_interest)){
...
}
This can occasionally lead to frustrating errors if, for some reason, vector_of_interest
has length 0. This might happen if there were no valid results in a previous step of the analysis. When this occurs, the loop tries to iterate over i = c(1, 0)
and causes an error.
Initiating loops using seq_along
avoids this problem because seq_along
will construct the approriate iteration sequence to skip the loop if vector_of_interest
has length zero. Compare the differences below.
vector_of_interest <- vector("numeric", length = 0)
1:length(vector_of_interest)
## [1] 1 0
seq_along(vector_of_interest)
## integer(0)
It may not seem like much of an issue now, but debugging these kinds of errors can be time consuming and frustrating.
We may also approach the problem by putting the results directly into the data frame.
ChickWeight$adj_r_sq_without_this_row <- rep(NA, length = nrow(ChickWeight))
for (i in seq_len(nrow(ChickWeight))){
temp_fit <- lm(formula = weight ~ Time,
data = ChickWeight[-i, ])
ChickWeight$adj_r_sq_without_this_row[i] <-
summary(temp_fit)$adj.r.squared
}
This sets us up to use either the hist
command or to use ggplot
plotting tools. Differences in these plots are related to differences in how the two functions define the break points under their respective defaults.
hist(ChickWeight$adj_r_sq_without_this_row)
library(ggplot2)
ggplot(data = ChickWeight,
mapping = aes(x = adj_r_sq_without_this_row)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
apply
functionsR features a family of functions that will apply a set of commands across a set of values. The functions perform all of the details of a for
loop for you, while avoiding some of the traps that can trip you up when writing your own for
loops. This requires having a function that will perform just one iteration of your loop.
jack_r_square <- function(row_to_omit){
fit <- lm(formula = weight ~ Time,
data = ChickWeight[-row_to_omit, ])
summary(fit)$adj.r.squared
}
result <- sapply(X = seq_len(nrow(ChickWeight)),
FUN = jack_r_square)
hist(result)