Bootstrapping is another form of cross-validation. Instead of systematically leaving out one observation, a bootstrap resamples from the existing data with replacement. A key characteristic of a bootstrap sample is that it contains the same number of observations as the original data set.
To take a look at sampling with replacement more closely, consider a simple vector consisting of the integers from 1 to 10.
example <- 1:10
A bootstrap sample of example
will contain 10 randomly selected elements of example
with replacement sampling, meaning any element may be chosen multiple times, and some elements will not be chosen at all.
sample(example,
size = 10,
replace = TRUE)
## [1] 10 6 10 2 6 2 3 2 1 1
When sampling from a data frame, we can sample from the unique rows
Example <- data.frame(id = seq_len(10),
some_data = rnorm(10))
sample(seq_len(nrow(Example)),
size = nrow(Example),
replace = TRUE)
## [1] 2 7 2 2 8 5 7 4 7 3
Using samples of size 10, 25, 50, 75, 100, 250, 500, and 1000, create a visualization that shows what percentage of a data set is sampled at each sample size. Use 100 bootstrap samples per sample size.
Since this challenge is focused on performing the repetitions, we’ll give you an easy to use function to get the percentage of a data set used within a bootstrap sample.
# n = sample size
percentage_used_bootstrap <- function(n){
ToyData <- data.frame(id = seq_len(n),
random_data = rnorm(n))
sampled_id <- sample(seq_len(nrow(ToyData)),
size = n,
replace = TRUE)
length(unique(sampled_id)) / nrow(ToyData)
}
Another hint is that R includes a call that will repeat an expression as many times as you want it to. See the help file for the ?replicate
function. An example of it’s usage is:
replicate(n = 3,
{
rnorm(10)
})
## [,1] [,2] [,3]
## [1,] -0.7387652 -0.64492242 -0.21318294
## [2,] 0.2923057 0.29165659 -0.87917170
## [3,] 0.2832526 1.42065941 -1.09536157
## [4,] -0.1117889 0.73044565 0.54498322
## [5,] 0.7945495 -0.57246376 1.34687739
## [6,] -1.0255168 -1.24607933 0.33403972
## [7,] -0.9433817 0.86416396 0.39113925
## [8,] -1.4758657 -0.07183034 -0.88818877
## [9,] -0.5960892 -1.48023946 -0.04375358
## [10,] -0.0216543 -1.47327958 -1.04872258
Notice that the result is a matrix where each column is a set of 10 randomly generated values. A feature of interest is that if your expression returns a single value, the result will be a vector, instead of a matrix.
replicate(n = 3,
{
rnorm(1)
})
## [1] -1.5516777 0.9846539 2.2874544