Use the rnorm() function to generate a predictor X of length n = 100, as well as a noise vector ε of length n = 100.
rm(list=ls())
set.seed(1)
X <- rnorm(100)
error <- rnorm(100)
Generate a response vector Y of length n = 100 according to the model: Y = β0 +β1X +β2X2 +β3X3 +ε, where β0, β1, β2, and β3 are 4, 9, 2, 1 respectively.
Y <- 4 + 9*X + 2*X^2 - 1*X^3 + error
Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors X, X2….X10. What is the best model obtained according to Cp, BIC, and adjusted R2 ? Show some plots to provide evidence for # your answer, and report the coefficients of the best model obtained. Note you will need to use the data.frame() function to create a single data set containing both X # and Y .
require(leaps)
## Loading required package: leaps
## Warning: package 'leaps' was built under R version 4.0.4
dataframe <- data.frame(Y, X)
?regsubsets
## starting httpd help server ...
## done
fit <- regsubsets(Y ~ poly(X, 10), data = dataframe, nvmax = 10)
fit_summary <- summary(fit)
require(tidyverse);require(ggplot2);require(ggthemes);
## Loading required package: tidyverse
## Warning: package 'tidyverse' was built under R version 4.0.4
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.3
## Warning: package 'tidyr' was built under R version 4.0.4
## Warning: package 'readr' was built under R version 4.0.3
## Warning: package 'dplyr' was built under R version 4.0.4
## Warning: package 'forcats' was built under R version 4.0.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: ggthemes
## Warning: package 'ggthemes' was built under R version 4.0.5
data_frame(Cp = fit_summary$cp,
BIC = fit_summary$bic,
AdjR2 = fit_summary$adjr2) %>%
mutate(id = row_number()) %>%
gather(value_type, value, -id) %>%
ggplot(aes(id, value, col = value_type)) +
geom_line() + geom_point() + ylab('') + xlab('Number of Variables Used') +
facet_wrap(~ value_type, scales = 'free') + scale_x_continuous(breaks = 1:10)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
3 seems to be the better number of predictors for the model (with high AdjR2 and low BIC and Cp)
Repeat 3, using forward stepwise selection and also using back- wards stepwise selection. How does your answer compare to the results in 3?
Testing forward stepwise selection
require(caret)
## Loading required package: caret
## Warning: package 'caret' was built under R version 4.0.5
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
model_back <- train(Y ~ poly(X, 10), data = dataframe,
method = 'glmStepAIC', direction = 'backward',
trace = 0,
trControl = trainControl(method = 'none', verboseIter = FALSE))
summary(model_back$finalModel)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8914 -0.5860 -0.1516 0.5892 2.1794
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.35264 0.09557 66.472 <2e-16 ***
## `poly(X, 10)1` 61.56415 0.95569 64.418 <2e-16 ***
## `poly(X, 10)2` 18.13980 0.95569 18.981 <2e-16 ***
## `poly(X, 10)3` -14.70908 0.95569 -15.391 <2e-16 ***
## `poly(X, 10)5` 1.48019 0.95569 1.549 0.125
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.9133516)
##
## Null deviance: 4424.513 on 99 degrees of freedom
## Residual deviance: 86.768 on 95 degrees of freedom
## AIC: 281.59
##
## Number of Fisher Scoring iterations: 2
Testing forward stepwise selection
x_poly <- poly(dataframe$X, 10)
colnames(x_poly) <- paste0('poly', 1:10)
model_forw <- train(y = Y, x = x_poly,
method = 'glmStepAIC', direction = 'forward',
trace = 0,
trControl = trainControl(method = 'none', verboseIter = FALSE))
summary(model_forw$finalModel)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8914 -0.5860 -0.1516 0.5892 2.1794
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.35264 0.09557 66.472 <2e-16 ***
## poly1 61.56415 0.95569 64.418 <2e-16 ***
## poly2 18.13980 0.95569 18.981 <2e-16 ***
## poly3 -14.70908 0.95569 -15.391 <2e-16 ***
## poly5 1.48019 0.95569 1.549 0.125
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.9133516)
##
## Null deviance: 4424.513 on 99 degrees of freedom
## Residual deviance: 86.768 on 95 degrees of freedom
## AIC: 281.59
##
## Number of Fisher Scoring iterations: 2