Bootstrapping

Bootstrapping is another form of cross-validation. Instead of systematically leaving out one observation, a bootstrap resamples from the existing data with replacement. A key characteristic of a bootstrap sample is that it contains the same number of observations as the original data set.

To take a look at sampling with replacement more closely, consider a simple vector consisting of the integers from 1 to 10.

example <- 1:10

A bootstrap sample of example will contain 10 randomly selected elements of example with replacement sampling, meaning any element may be chosen multiple times, and some elements will not be chosen at all.

sample(example, 
       size = 10, 
       replace = TRUE)
##  [1] 10  6 10  2  6  2  3  2  1  1

When sampling from a data frame, we can sample from the unique rows

Example <- data.frame(id = seq_len(10), 
                      some_data = rnorm(10))

sample(seq_len(nrow(Example)), 
       size = nrow(Example),
       replace = TRUE)
##  [1] 2 7 2 2 8 5 7 4 7 3

Challenge

Using samples of size 10, 25, 50, 75, 100, 250, 500, and 1000, create a visualization that shows what percentage of a data set is sampled at each sample size. Use 100 bootstrap samples per sample size.

Some helpful hints

Since this challenge is focused on performing the repetitions, we’ll give you an easy to use function to get the percentage of a data set used within a bootstrap sample.

# n = sample size
percentage_used_bootstrap <- function(n){
  ToyData <- data.frame(id = seq_len(n), 
                        random_data = rnorm(n))
  sampled_id <- sample(seq_len(nrow(ToyData)), 
                       size = n, 
                       replace = TRUE)
  length(unique(sampled_id)) / nrow(ToyData)
}

Another hint is that R includes a call that will repeat an expression as many times as you want it to. See the help file for the ?replicate function. An example of it’s usage is:

replicate(n = 3, 
          {
            rnorm(10)
          })
##             [,1]        [,2]        [,3]
##  [1,] -0.7387652 -0.64492242 -0.21318294
##  [2,]  0.2923057  0.29165659 -0.87917170
##  [3,]  0.2832526  1.42065941 -1.09536157
##  [4,] -0.1117889  0.73044565  0.54498322
##  [5,]  0.7945495 -0.57246376  1.34687739
##  [6,] -1.0255168 -1.24607933  0.33403972
##  [7,] -0.9433817  0.86416396  0.39113925
##  [8,] -1.4758657 -0.07183034 -0.88818877
##  [9,] -0.5960892 -1.48023946 -0.04375358
## [10,] -0.0216543 -1.47327958 -1.04872258

Notice that the result is a matrix where each column is a set of 10 randomly generated values. A feature of interest is that if your expression returns a single value, the result will be a vector, instead of a matrix.

replicate(n = 3, 
          {
            rnorm(1)
          })
## [1] -1.5516777  0.9846539  2.2874544