This material is released under an Attribution-NonCommercial-ShareAlike 3.0 United States license. Original author: Alan T. Arnholt
Follow all directions. Type complete sentences to answer all questions inside the answer
tags provided in the R Markdown document. Round all numeric answers you report inside the answer tags to four decimal places. Use inline R
code to report numeric answers inside the answer
tags (i.e. do not hard code your numeric answers).
The article by Johnson (1996) defines bodyfat determined with the siri and brozek methods as well as fat free weight using equations (1), (2), and (3), respectively.
\[\begin{equation} \text{bodyfatSiri} = \frac{457}{\text{density}} - 414.2 \tag{1} \end{equation}\] \[\begin{equation} \text{bodyfatBrozek} = \frac{495}{\text{density}} - 450 \tag{2} \end{equation}\] \[\begin{equation} \text{FatFreeWeight} = \left(1 -\frac{\text{brozek}}{100}\times \text{weight_lbs}\right) \tag{3} \end{equation}\]Body Mass Index (BMI
) is defined as
\[\text{BMI} = \frac{\text{kg}}{\text{m}^2}\] Please use the following conversion factors with this project: 0.453592 kilos per pound and 2.54 centimeters per inch.
Use the original data from and evaluate the quality of the data. Specifically, start by using the fread()
function from the data.table
package written by Dowle and Srinivasan (2019) to read the data from into an object named bodyfat
. Pass the following vector of names to the col.names
argument of fread()
: c("case", "brozek", "siri", "density", "age", "weight_lbs", "height_in", "bmi", "fat_free_weight",
"neck_cm", "chest_cm", "abdomen_cm", "hip_cm", "thigh_cm", "knee_cm", "ankle_cm", "biceps_cm",
"forearm_cm", "wrist_cm")
# Type your code and comments inside the code chunk
# Obtaining the original data
names <- c("case", "brozek", "siri", "density", "age", "weight_lbs", "height_in", "bmi", "fat_free_weight", "neck_cm", "chest_cm", "abdomen_cm", "hip_cm", "thigh_cm", "knee_cm", "ankle_cm", "biceps_cm", "forearm_cm", "wrist_cm")
bodyfat <- fread("", col.names = names)
Create plotly
interactive scatterplots of brozek
versus density
with case
mapped to color
, weight_lbs
versus height_in
with case
mapped to color
, and ankle_cm
versus weight_lbs
with case
mapped to color
to help identify potential outliers. How many values do you think are potentially data entry errors? Explain your reasoning and show the code you used to identify the errors.
# Type your code and comments inside the code chunk
# Creating interactive scatterplot of brozek versus density
p <- ggplot(data = bodyfat, aes(x = density, y = brozek,
color = case)) +
geom_point() +
g <- ggplotly(p)
Figure 1: Plot of brozek
versus density
bodyfat[c(48, 76, 96, 42, 182, 216),
c("density", "brozek", "siri", "height_in", "weight_lbs")]
density brozek siri height_in weight_lbs
1: 1.0665 6.4 5.6 71.25 148.50
2: 1.0666 18.3 18.5 67.50 148.25
3: 1.0991 17.3 17.4 77.75 224.50
4: 1.0250 31.7 32.9 29.50 205.00
5: 1.1089 0.0 0.0 68.00 118.50
6: 0.9950 45.1 47.5 64.00 219.00
bodyfat <- bodyfat %>% mutate(brozek_eq = (495 / density) - 450 )
bodyfat<- bodyfat %>% mutate(broDiff = brozek - brozek_eq)
bodyfat %>%
filter(abs(broDiff) >2)
case brozek siri density age weight_lbs height_in bmi fat_free_weight
1 48 6.4 5.6 1.0665 39 148.50 71.25 20.6 139.0
2 76 18.3 18.5 1.0666 61 148.25 67.50 22.9 121.1
3 96 17.3 17.4 1.0991 53 224.50 77.75 26.1 185.7
4 182 0.0 0.0 1.1089 40 118.50 68.00 18.1 118.5
5 216 45.1 47.5 0.9950 51 219.00 64.00 37.6 120.2
neck_cm chest_cm abdomen_cm hip_cm thigh_cm knee_cm ankle_cm biceps_cm
1 34.6 89.8 79.5 92.7 52.7 37.5 21.9 28.8
2 36.0 91.6 81.8 94.8 54.5 37.0 21.4 29.3
3 41.1 113.2 99.2 107.5 61.7 42.3 23.2 32.9
4 33.8 79.3 69.4 85.0 47.2 33.5 20.2 27.7
5 41.2 119.8 122.1 112.8 62.5 36.9 23.6 34.7
forearm_cm wrist_cm brozek_eq broDiff
1 26.8 17.9 14.1350211 -7.735021
2 27.0 18.3 14.0915057 4.208494
3 30.8 20.4 0.3684833 16.931517
4 24.6 16.5 -3.6116873 3.611687
5 29.1 18.4 47.4874372 -2.387437
The potential data entry errors include the cases that lie outside of the line since the formula for brozek
is a linear transformation of density. These outliers include cases 96, 76, 48, 182. Case 216 is most likely a data entry error as well. We came to these conclusions because after creating a new varible called brozek_eq
, that computes what the exact brozek variable should be from the brozek equation, we found which cases had an absolute value diffrence greater than 2 between the two brozek varibels (both brozek
and brozek_eq
). This is shown in our code above.
plot_ly(data = bodyfat, x = ~brozek, y = ~density,
marker = list(size = 5,
color = ~case,
line = list(color = ~case,
width = 1)))
# Type your code and comments inside the code chunk
# Creating interactive scatterplot of weight_lbs versus height_in
p <- ggplot(data = bodyfat, aes(x = weight_lbs, y = height_in,
color = case)) +
geom_point() +
g <- ggplotly(p)
Figure 2: Plot of weight_lbs
versus height_in
There are a few outliers in this graph, however, there seems to only be one data entry error. This data entry error is case 42. This case is practically impossible with a height of 29.5 inches and a wieght of 205 lbs. The rest of the data are posisble combinations of height and weight. We expect this relationship to have some variability as one isnt computed directly from the other.
plot_ly(data = bodyfat, x = ~weight_lbs, y = ~height_in,
marker = list(size = 5,
color = ~case,
line = list(color = ~case,
width = 1)))
# Type your code and comments inside the code chunk
# Isolating points of interest
# Points of interest for brozek vs. density graph
bodyfat[c(48, 76, 96, 42, 182, 216),
c("density", "brozek", "siri", "height_in", "weight_lbs" )]
density brozek siri height_in weight_lbs
48 1.0665 6.4 5.6 71.25 148.50
76 1.0666 18.3 18.5 67.50 148.25
96 1.0991 17.3 17.4 77.75 224.50
42 1.0250 31.7 32.9 29.50 205.00
182 1.1089 0.0 0.0 68.00 118.50
216 0.9950 45.1 47.5 64.00 219.00
# Points of interest for height_in vs. weight_lbs graph
c("density", "brozek", "siri", "height_in", "weight_lbs" )]
density brozek siri height_in weight_lbs
42 1.025 31.7 32.9 29.5 205
# Type your code and comments inside the code chunk
# Replacing identified typos of density and height_in
p <- ggplot(data = bodyfat, aes(x = density, y = brozek,
color = case)) +
geom_point() +
g <- ggplotly(p)
# Updating computed bodyfat values and bmi measurements
# Type your code and comments inside the code chunk
# Creating interactive scatterplot of ankle_cm versus weight_lbs
p <- ggplot(data = bodyfat, aes(x = ankle_cm, y = weight_lbs,
color = case)) +
geom_point() +
g <- ggplotly(p)
Figure 3: Interactive scatterplot of ankle_cm
versus weight_lbs
It looks like cases 31 and 86 could be data entry errors. Case 39 (top) is likely not an entry error, but possibly just an outlier. Cases 31 and 86 show abnormally large ankle diameters with average weights. The data entries should probably be 23.9 (case 31) and 23.7 (case 86).
# Type your code and comments inside the code chunk
# Creating interactive scatterplot of ankle_cm versus weight_lbs
plot_ly(data = bodyfat, x = ~ankle_cm, y = ~weight_lbs,
marker = list(size = 5,
color = ~case,
line = list(color = ~case,
width = 1)))
Figure 4: Interactive scatterplot of ankle_cm
versus weight_lbs
# Type your code and comments inside the code chunk
# Replacing identified typos in ankle_cm
bodyfat$ankle_cm[31] <- 23.9
bodyfat$ankle_cm[86] <- 23.7
p <- ggplot(data = bodyfat, aes(x = ankle_cm, y = weight_lbs,
color = case)) +
geom_point() +
g <- ggplotly(p)
# Type your code and comments inside the code chunk
# Identifying bodyfat typos for brozek and siri
p <- ggplot(data = bodyfat, aes(x = brozek, y = siri,
color = case)) +
geom_point() +
g <- ggplotly(p)
# Type your code and comments inside the code chunk
# Number of rounding discrepancies for siri
bodyfat<- bodyfat %>%
mutate(siri_eq = round((457/density - 414.2),1))
sum(bodyfat$siri != bodyfat$siri_eq)
[1] 242
# Number of rounding discrepancies for brozek
sum(bodyfat$brozek != bodyfat$brozek_eq)
[1] 252
# Number of rounding discrepancies for bmi
height_m <- (bodyfat$height_in * 2.54) / 100
bodyfat<- bodyfat %>%
mutate(bmi_eq = round(((weight_lbs * 0.453592) / ((height_m) ^ 2)),1))
sum(bodyfat$bmi != bodyfat$bmi_eq)
[1] 99
Case 182 is a typo because you can’t have 0 body fat. Case 169 is a possible type because it doesn’t follow the linear line.
Both of the possible typos mentioned above are most likely rounding errors. Case 182 was most likely a very very small number and ended up getting rounded to 0.
Make the clean data accessible to R
Load the file bodyfatClean.csv
from into your R
session. Specifically, use the read.csv()
function to load the file bodyfatClean.csv
into your current R
session naming the object cleaned_bf
. Since GitHub stores the file as html, click on the raw button to obtain a *.csv
# Type your code and comments inside the code chunk
# Read in clean data
cleaned_bf <- read.csv("")
Use the glimpse()
function from the dplyr
package written by Wickham et al. (2019) to view the structure of cleaned_bf
# Type your code and comments inside the code chunk
# Examining the object cleaned_bf
Observations: 251
Variables: 18
$ age <int> 23, 22, 22, 26, 24, 24, 26, 25, 25, 23, 26, 27, 32…
$ weight_lbs <dbl> 154.25, 173.25, 154.00, 184.75, 184.25, 210.25, 18…
$ height_in <dbl> 67.75, 72.25, 66.25, 72.25, 71.25, 74.75, 69.75, 7…
$ neck_cm <dbl> 36.2, 38.5, 34.0, 37.4, 34.4, 39.0, 36.4, 37.8, 38…
$ chest_cm <dbl> 93.1, 93.6, 95.8, 101.8, 97.3, 104.5, 105.1, 99.6,…
$ abdomen_cm <dbl> 85.2, 83.0, 87.9, 86.4, 100.0, 94.4, 90.7, 88.5, 8…
$ hip_cm <dbl> 94.5, 98.7, 99.2, 101.2, 101.9, 107.8, 100.3, 97.1…
$ thigh_cm <dbl> 59.0, 58.7, 59.6, 60.1, 63.2, 66.0, 58.4, 60.0, 62…
$ knee_cm <dbl> 37.3, 37.3, 38.9, 37.3, 42.2, 42.0, 38.3, 39.4, 38…
$ ankle_cm <dbl> 21.9, 23.4, 24.0, 22.8, 24.0, 25.6, 22.9, 23.2, 23…
$ biceps_cm <dbl> 32.0, 30.5, 28.8, 32.4, 32.2, 35.7, 31.9, 30.5, 35…
$ forearm_cm <dbl> 27.4, 28.9, 25.2, 29.4, 27.7, 30.6, 27.8, 29.0, 31…
$ wrist_cm <dbl> 17.1, 18.2, 16.6, 18.2, 17.7, 18.8, 17.7, 18.8, 18…
$ brozek_C <dbl> 12.6, 6.9, 24.6, 10.9, 27.8, 20.5, 19.0, 12.7, 5.1…
$ bmi_C <dbl> 23.6, 23.3, 24.7, 24.9, 25.5, 26.5, 26.2, 23.5, 24…
$ age_sq <int> 529, 484, 484, 676, 576, 576, 676, 625, 625, 529, …
$ abdomen_wrist <dbl> 68.1, 64.8, 71.3, 68.2, 82.3, 75.6, 73.0, 69.7, 64…
$ am <dbl> 181.9365, 169.1583, 195.5067, 182.7203, 190.6993, …
Partition the data.
Use the creatDataPartition()
function from the caret
package to partition the data in to training
and testing
. Use 80% of the data for training and 20% for testing. To ensure reproducibility of the partition, use set.seed(314)
. The response variable you want to use is brozek_C
(the computed brozek based on the reported density).
# Type your code and comments inside the code chunk
# Partitioning the data
in_train <- createDataPartition(cleaned_bf$brozek_C, p = 0.8, list = FALSE)
training <- cleaned_bf[in_train, ]
testing <- cleaned_bf[-in_train, ]
Use the dim()
function to verify the sizes of training
and testing
data sets.
# Type your code and comments inside the code chunk
# Verifying dimensions of training and testing
[1] 203 18
[1] 48 18
There are 203 observations and 18 varibles in the training dataset. There are 48 observations and 18 variables in the testing dataset.
Transform the data.
Use the preProcess()
function to transform the predictors that are in the training
data set. Specifically, pass a vector with "center"
, "scale"
, and "BoxCox"
to the method
argument of preProcess()
. Make sure not to transform the response (brozek_C
# Type your code and comments inside the code chunk
# Transforming the data
training_pp <- preProcess(training, method = c("center", "scale", "BoxCox"))
Use the predict()
function to construct a transformed training set and a transformed testing set. Name the new transformed data sets trainingTrans
and testingTrans
, respectively.
# Type your code and comments inside the code chunk
# Creating trainingTrans and testingTrans
trainingTrans <- predict(training_pp, training)
testingTrans <- predict(training_pp, testing)
Use the trainControl()
function to define the resampling method (repeated cross-validation), the number of resampling iterations (10), and the number of repeats or complete sets to generate (5), storing the results in the object myControl
# Type your code and comments inside the code chunk
# Define the type of resampling
myControl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
Fit a linear regression model using forward stepwise selection.
Use the corrplot()
function from the corrplot
package written by Wei and Simko (2017) to identify predictors that may be linearly related in trainingTrans
. Are any of the variables colinear? If so, remove the predictor that is least correlated to the response variable. Note that when method = "number"
is used with corrplot()
, color coded numerical correlations are displayed.
# Type your code and comments inside the code chunk
# Identifying linearly related predictors
cor <- cor(trainingTrans)
corrplot(cor, method = "number")
cm <- cor(x = trainingTrans$abdomen_cm, y=trainingTrans$brozek_C)
wrist <- cor(x = trainingTrans$abdomen_wrist, y=trainingTrans$brozek_C)
Age and age_sq are colinear, as well as abdoment_wrist and abdomen_cm.
Use the train()
function with method = "leapForward"
, tuneLength = 10
and assign the object myControl
to the trControl
argument of the train()
function to fit a forward selection model where the goal is to predict body fat. Use brozek_C
as the response and store the results of train()
in mod_FS
. Use set.seed(42)
for reproducibility. Do not include any predictors that are perfectly correlated.
# Type your code and comments inside the code chunk
# Fit model with forward stepwise selection
mod_FS <- train(brozek_C ~ . -age -abdomen_cm,
method = "leapForward",
tuneLength = 10,
trControl = myControl)
Print mod_FS
to the R
# Type your code and comments inside the code chunk
# Printing mod_FS
Linear Regression with Forward Selection
203 samples
17 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ...
Resampling results across tuning parameters:
nvmax RMSE Rsquared MAE
2 0.5506026 0.7033709 0.4578342
3 0.5383992 0.7165354 0.4511958
4 0.5334791 0.7221570 0.4462905
5 0.5356176 0.7216630 0.4476914
6 0.5343806 0.7226373 0.4450996
7 0.5335540 0.7249148 0.4429665
8 0.5278222 0.7309489 0.4384376
9 0.5287631 0.7292929 0.4399173
10 0.5294384 0.7279965 0.4399193
11 0.5280666 0.7291664 0.4398026
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 8.
Using the output in your console, what criterion has been used to pick the best submodel? What is the value of the criterion that has been used? How many predictor variables are selected?
# Type your code and comments inside the code chunk
# Isolating results from mod_FS
7 8
Using RMSE with 8 predictors, the best model has a RSME value of .52878222.
Use the summary()
function to find out which predictors are selected as the final submodel.
# Type your code and comments inside the code chunk
# Viewing final model
Subset selection object
15 Variables (and intercept)
Forced in Forced out
weight_lbs FALSE FALSE
height_in FALSE FALSE
chest_cm FALSE FALSE
thigh_cm FALSE FALSE
ankle_cm FALSE FALSE
biceps_cm FALSE FALSE
forearm_cm FALSE FALSE
wrist_cm FALSE FALSE
abdomen_wrist FALSE FALSE
1 subsets of each size up to 8
Selection Algorithm: forward
weight_lbs height_in neck_cm chest_cm hip_cm thigh_cm knee_cm
1 ( 1 ) " " " " " " " " " " " " " "
2 ( 1 ) "*" " " " " " " " " " " " "
3 ( 1 ) "*" " " " " " " " " " " " "
4 ( 1 ) "*" " " " " " " " " " " " "
5 ( 1 ) "*" " " " " " " " " " " " "
6 ( 1 ) "*" " " "*" " " " " " " " "
7 ( 1 ) "*" " " "*" "*" " " " " " "
8 ( 1 ) "*" " " "*" "*" " " " " " "
ankle_cm biceps_cm forearm_cm wrist_cm bmi_C age_sq abdomen_wrist
1 ( 1 ) " " " " " " " " " " " " "*"
2 ( 1 ) " " " " " " " " " " " " "*"
3 ( 1 ) " " " " " " "*" " " " " "*"
4 ( 1 ) " " " " " " "*" " " "*" "*"
5 ( 1 ) " " " " " " "*" "*" "*" "*"
6 ( 1 ) " " " " " " "*" "*" "*" "*"
7 ( 1 ) " " " " " " "*" "*" "*" "*"
8 ( 1 ) " " " " " " "*" "*" "*" "*"
1 ( 1 ) " "
2 ( 1 ) " "
3 ( 1 ) " "
4 ( 1 ) " "
5 ( 1 ) " "
6 ( 1 ) " "
7 ( 1 ) " "
8 ( 1 ) "*"
The varibles weight_lbs, height_in, neck_cm, chest_cm, hip_cm, thigh_cm, knee_cm, ankle_cm, biceps_cm, forearm_cm, wrist_cm, bmi_C, age_sg, abdomen_wrist, and am are the predictors selected as the final submodel.
Compute the RMSE for mod_FS
using the testing
data set.
# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
RMSE_FS <- RMSE(predict(mod_FS, testingTrans), testingTrans$brozek_C)
[1] 0.6117522
Fit a linear regression model using backward stepwise selection.
Use the train()
function with method = "leapBackward"
, tuneLength = 10
and assign the object myControl
to the trControl
argument of the train()
function to fit a backward elimination model where the goal is to predict body fat. Use brozek_C
as the response and store the results of train()
in mod_BE
. Use set.seed(42)
for reproducibility. Do not include any predictors that are perfectly correlated.
# Type your code and comments inside the code chunk
# Fit model with backwards stepwise selection
mod_BE <- train(brozek_C ~ . -age -abdomen_cm,
method = "leapBackward",
tuneLength = 10,
trControl = myControl)
Print mod_BE
to the R
# Type your code and comments inside the code chunk
# Printing mod_BE
Linear Regression with Backwards Selection
203 samples
17 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ...
Resampling results across tuning parameters:
nvmax RMSE Rsquared MAE
2 0.5331715 0.7223546 0.4437283
3 0.5262643 0.7294633 0.4357332
4 0.5330154 0.7238807 0.4422245
5 0.5304839 0.7267507 0.4397319
6 0.5305875 0.7264251 0.4403598
7 0.5307921 0.7270677 0.4404149
8 0.5246361 0.7332446 0.4357663
9 0.5286987 0.7294381 0.4404245
10 0.5283786 0.7299483 0.4397340
11 0.5279656 0.7304313 0.4402907
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 8.
According to the output, what criterion has been used to pick the best submodel? What is the value of the criterion that has been used? How many predictor variables are selected?
# Type your code and comments inside the code chunk
# Viewing results from mod_BE
7 8
The criterion for picking the best model is the RMSE with 8 variables. The value of the RSME is 0.5246361.
Use the summary()
function to find out which predictors are selected as the final submodel.
# Type your code and comments inside the code chunk
# Viewing final model
Subset selection object
15 Variables (and intercept)
Forced in Forced out
weight_lbs FALSE FALSE
height_in FALSE FALSE
chest_cm FALSE FALSE
thigh_cm FALSE FALSE
ankle_cm FALSE FALSE
biceps_cm FALSE FALSE
forearm_cm FALSE FALSE
wrist_cm FALSE FALSE
abdomen_wrist FALSE FALSE
1 subsets of each size up to 8
Selection Algorithm: backward
weight_lbs height_in neck_cm chest_cm hip_cm thigh_cm knee_cm
1 ( 1 ) " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " " " "
4 ( 1 ) " " " " " " " " " " " " " "
5 ( 1 ) " " " " "*" " " " " " " " "
6 ( 1 ) " " " " "*" "*" " " " " " "
7 ( 1 ) " " " " "*" "*" " " " " " "
8 ( 1 ) "*" " " "*" "*" " " " " " "
ankle_cm biceps_cm forearm_cm wrist_cm bmi_C age_sq abdomen_wrist
1 ( 1 ) " " " " " " " " " " " " "*"
2 ( 1 ) " " " " " " "*" " " " " "*"
3 ( 1 ) " " " " " " "*" " " "*" "*"
4 ( 1 ) " " " " " " "*" "*" "*" "*"
5 ( 1 ) " " " " " " "*" "*" "*" "*"
6 ( 1 ) " " " " " " "*" "*" "*" "*"
7 ( 1 ) " " " " " " "*" "*" "*" "*"
8 ( 1 ) " " " " " " "*" "*" "*" "*"
1 ( 1 ) " "
2 ( 1 ) " "
3 ( 1 ) " "
4 ( 1 ) " "
5 ( 1 ) " "
6 ( 1 ) " "
7 ( 1 ) "*"
8 ( 1 ) "*"
The variables weight_lbs, height_in, neck_cm, chest_cm, hip_cm, tight_cm, knee_cm, ankle_cm, biceps_cm, forearm_cm, wrist_cm, bmi_C, age_sq, abdomen_wrist, and am are selcted as the predictors for the final submodel.
Compute the RMSE for mod_BE
using the testing
data set.
# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
RMSE_BE <- RMSE(predict(mod_BE, testingTrans), testingTrans$brozek_C)
[1] 0.6117522
Fit a constrained linear regression model.
Use the train
function with method = "glmnet"
and tuneLength= 10
to fit a constrained linear regression model named mod_EN
. Use set.seed(42)
for reproducibility. Do not include any predictors that are perfectly correlated.
# Type your code and comments inside the code chunk
# Fit constrained model (elastic net)
mod_EN <- train(brozek_C ~ . -age_sq -abdomen_cm,
data = trainingTrans,
method = "glmnet",
tuneLength = 10,
trControl = myControl)
Print mod_EN
to the R
# Type your code and comments inside the code chunk
# Printing mod_EN
203 samples
17 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ...
Resampling results across tuning parameters:
alpha lambda RMSE Rsquared MAE
0.1 0.0003820240 0.5231616 0.7334311 0.4329046
0.1 0.0008825249 0.5231413 0.7334561 0.4328746
0.1 0.0020387471 0.5228081 0.7338136 0.4323957
0.1 0.0047097703 0.5223652 0.7341611 0.4317239
0.1 0.0108801806 0.5219756 0.7344251 0.4309068
0.1 0.0251346291 0.5234142 0.7329996 0.4317356
0.1 0.0580642545 0.5272526 0.7298459 0.4365018
0.1 0.1341359623 0.5378889 0.7224132 0.4476435
0.1 0.3098714780 0.5703232 0.6959115 0.4785758
0.1 0.7158433225 0.6267836 0.6504147 0.5257102
0.2 0.0003820240 0.5236133 0.7330705 0.4333380
0.2 0.0008825249 0.5230928 0.7335895 0.4327565
0.2 0.0020387471 0.5225364 0.7340826 0.4321162
0.2 0.0047097703 0.5220124 0.7344967 0.4314014
0.2 0.0108801806 0.5216392 0.7347328 0.4305806
0.2 0.0251346291 0.5233570 0.7330851 0.4323391
0.2 0.0580642545 0.5257075 0.7320202 0.4366554
0.2 0.1341359623 0.5400637 0.7225144 0.4520196
0.2 0.3098714780 0.5825931 0.6894484 0.4901949
0.2 0.7158433225 0.6468834 0.6542099 0.5420342
0.3 0.0003820240 0.5234028 0.7332712 0.4330782
0.3 0.0008825249 0.5229147 0.7337588 0.4325531
0.3 0.0020387471 0.5223650 0.7342351 0.4319473
0.3 0.0047097703 0.5217919 0.7346887 0.4311637
0.3 0.0108801806 0.5215840 0.7347496 0.4304884
0.3 0.0251346291 0.5230932 0.7334299 0.4328241
0.3 0.0580642545 0.5252148 0.7332064 0.4373919
0.3 0.1341359623 0.5446221 0.7196972 0.4572548
0.3 0.3098714780 0.5920870 0.6850715 0.4998493
0.3 0.7158433225 0.6704415 0.6602620 0.5592390
0.4 0.0003820240 0.5227559 0.7339039 0.4323624
0.4 0.0008825249 0.5225357 0.7341307 0.4321732
0.4 0.0020387471 0.5221812 0.7344077 0.4317626
0.4 0.0047097703 0.5216434 0.7348231 0.4309913
0.4 0.0108801806 0.5216175 0.7347027 0.4306022
0.4 0.0251346291 0.5226270 0.7339834 0.4330053
0.4 0.0580642545 0.5255304 0.7334863 0.4386092
0.4 0.1341359623 0.5486650 0.7172946 0.4613964
0.4 0.3098714780 0.5980475 0.6871362 0.5056990
0.4 0.7158433225 0.6931781 0.6751992 0.5805263
0.5 0.0003820240 0.5225069 0.7341692 0.4320918
0.5 0.0008825249 0.5223854 0.7342729 0.4320245
0.5 0.0020387471 0.5220775 0.7344933 0.4316556
0.5 0.0047097703 0.5215634 0.7348921 0.4308651
0.5 0.0108801806 0.5216570 0.7346623 0.4308962
0.5 0.0251346291 0.5221901 0.7344491 0.4332407
0.5 0.0580642545 0.5265059 0.7328969 0.4404169
0.5 0.1341359623 0.5550801 0.7117445 0.4676511
0.5 0.3098714780 0.6046765 0.6891083 0.5112450
0.5 0.7158433225 0.7216325 0.6842853 0.6065182
0.6 0.0003820240 0.5228011 0.7339215 0.4324977
0.6 0.0008825249 0.5224106 0.7342617 0.4320815
0.6 0.0020387471 0.5219702 0.7345905 0.4315486
0.6 0.0047097703 0.5214816 0.7349591 0.4307283
0.6 0.0108801806 0.5216594 0.7346453 0.4312456
0.6 0.0251346291 0.5218164 0.7348499 0.4336438
0.6 0.0580642545 0.5284605 0.7312139 0.4427067
0.6 0.1341359623 0.5630044 0.7037609 0.4744761
0.6 0.3098714780 0.6096292 0.6937591 0.5152154
0.6 0.7158433225 0.7546190 0.6918228 0.6344184
0.7 0.0003820240 0.5225943 0.7340715 0.4322544
0.7 0.0008825249 0.5223961 0.7342794 0.4320480
0.7 0.0020387471 0.5218804 0.7346772 0.4314297
0.7 0.0047097703 0.5214031 0.7350001 0.4305883
0.7 0.0108801806 0.5217405 0.7345737 0.4316623
0.7 0.0251346291 0.5215655 0.7351777 0.4339676
0.7 0.0580642545 0.5313261 0.7284626 0.4457845
0.7 0.1341359623 0.5681455 0.6986935 0.4790050
0.7 0.3098714780 0.6162603 0.6955693 0.5204536
0.7 0.7158433225 0.7881084 0.6953102 0.6610099
0.8 0.0003820240 0.5226445 0.7340408 0.4322954
0.8 0.0008825249 0.5223703 0.7343052 0.4320652
0.8 0.0020387471 0.5218030 0.7347453 0.4313461
0.8 0.0047097703 0.5213865 0.7350036 0.4305462
0.8 0.0108801806 0.5218877 0.7344386 0.4322037
0.8 0.0251346291 0.5216047 0.7352628 0.4344838
0.8 0.0580642545 0.5339401 0.7258578 0.4486395
0.8 0.1341359623 0.5708584 0.6968780 0.4813561
0.8 0.3098714780 0.6242583 0.6950890 0.5261636
0.8 0.7158433225 0.8211556 0.6957120 0.6881450
0.9 0.0003820240 0.5224549 0.7342159 0.4321450
0.9 0.0008825249 0.5223022 0.7343360 0.4319805
0.9 0.0020387471 0.5217390 0.7347943 0.4312610
0.9 0.0047097703 0.5213608 0.7350144 0.4305236
0.9 0.0108801806 0.5220283 0.7342824 0.4327218
0.9 0.0251346291 0.5218485 0.7351628 0.4351333
0.9 0.0580642545 0.5360819 0.7237029 0.4507902
0.9 0.1341359623 0.5731573 0.6955409 0.4829651
0.9 0.3098714780 0.6313123 0.6950041 0.5307282
0.9 0.7158433225 0.8599629 0.6957120 0.7207924
1.0 0.0003820240 0.5224778 0.7342351 0.4321632
1.0 0.0008825249 0.5222664 0.7343740 0.4319393
1.0 0.0020387471 0.5216845 0.7348274 0.4311726
1.0 0.0047097703 0.5212938 0.7350651 0.4304975
1.0 0.0108801806 0.5221174 0.7341735 0.4330826
1.0 0.0251346291 0.5224455 0.7346374 0.4360844
1.0 0.0580642545 0.5377594 0.7220514 0.4521799
1.0 0.1341359623 0.5747790 0.6947389 0.4838089
1.0 0.3098714780 0.6378602 0.6957120 0.5356386
1.0 0.7158433225 0.9060164 0.6957120 0.7594059
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.00470977.
According to the output, what criterion was used to pick the best submodel? What is the value of this criterion? Plot the object mod_EN
# Type your code and comments inside the code chunk
# Viewing results from mod_EN and plotting mod_EN
Using RMSE to pick the best model, alpha has a value of 1, lambda has a value of .00470977, with RMSE having a value of .5212938.
Compute the RMSE for mod_EN
using the testing
data set.
# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
RMSE_EN <- RMSE(predict(mod_EN, testingTrans), testingTrans$brozek_C)
[1] 0.6246849
Fit a regression tree.
Use the train()
function with method = "rpart"
, tuneLength = 10
along with the myControl
as the trControl
to fit a regression tree named mod_TR
. Use set.seed(42)
for reproducibility. Do not include any predictors that are perfectly correlated.
# Type your code and comments inside the code chunk
# Fit Regression Tree
mod_TR <- train(brozek_C ~ . -age -abdomen_cm,
data = trainingTrans,
method = "rpart",
tuneLength = 10,
trControl = myControl)
Print mod_TR
to the R
# Type your code and comments inside the code chunk
# Printing mod_TR
203 samples
17 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.007703033 0.6636885 0.5851521 0.5433992
0.008433471 0.6625036 0.5849659 0.5434520
0.010116937 0.6542281 0.5939651 0.5376865
0.011548602 0.6543465 0.5930476 0.5386454
0.015615982 0.6604265 0.5864218 0.5462687
0.020312130 0.6588497 0.5808712 0.5451845
0.025576073 0.6477331 0.5946999 0.5348745
0.036213572 0.6592349 0.5789596 0.5380807
0.097234710 0.7291152 0.4830263 0.5844113
0.525027794 0.8741036 0.4114001 0.7219792
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.02557607.
According to the output, what criterion was used to pick the best submodel? What is the value of this criterion?
# Type your code and comments inside the code chunk
# Viewing results from mod_TR
Using RSME to determine the best model, the complexity value of the model is .02557607 with an RMSE of 0.6477331.
Use the rpart()
function from the rpart
package written by Therneau and Atkinson (2018) to build the regression tree using the complexity parameter (cp
) value from mod_TR
above. Name this tree mod_TR3
# Type your code and comments inside the code chunk
# Building regression tree using rpart
mod_TR3 <- rpart(brozek_C ~ . -age -abdomen_cm,
data = trainingTrans,
control = rpart.control(cp = mod_TR$bestTune$cp, xval = 50))
Use the plot()
function from the partykit
package written by Hothorn and Zeileis (2019) to graph mod_TR3
# Type your code and comments inside the code chunk
# Plotting mod_TR3 with partykit
Use the rpart.plot()
function from the rpart.plot
package written by Milborrow (2018) to graph mod_TR3
# Type your code and comments inside the code chunk
# Plotting mod_TR3 with rpart.plot
What predictors are used in the graph of mod_TR3
The predictor used in the graph of mod_TR3 is the variable abdomen_wrist.
Explain the tree
This tree says that if someone as a abdomen_wrist less that -0.3 and less than -0.9 than there is a 20% chance that person’s abdomen_wrist tree predictions are correct. While on the other hand, if someone has a abdomen_wrist greater than -0.3, an abdomen_wrist greater than 0.77, and an abdomen_wrist greater than 1.8 there is a 3% chance that the this person’s abdomen_wrist tree predictions are correct.
According to the tree, the abdomen_wrist
measurements can be negative. Is this possible? If so, explain the reason for the negative values.
The variable abdomen_wrist is equal to abdomen_cm - wrist_cm, so if abdomen wrist is negative it’s because wrist_cm is larger than abdomen_cm which is possible.
Compute the RMSE for mod_TR3
using the testing
data set.
# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
RMSE_TR <- RMSE(predict(mod_TR3, testingTrans), testingTrans$brozek_C)
[1] 0.7093008
Fit a Random Forest model.
Use the train()
function with method = "ranger"
, tuneLength = 10
along with the myControl
as the trControl
to fit a regression tree named mod_RF
. Use set.seed(42)
for reproducibility. Do not include any predictors that are perfectly correlated.
# Type your code and comments inside the code chunk
# Fit Random Forest model
mod_RF <- train(brozek_C ~ . -age -abdomen_cm,
data = trainingTrans,
method = "ranger",
tuneLength = 10,
trControl = myControl)
Print mod_RF
to the R
# Type your code and comments inside the code chunk
# Printing mod_RF
Random Forest
203 samples
17 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ...
Resampling results across tuning parameters:
mtry splitrule RMSE Rsquared MAE
2 variance 0.6033529 0.6553213 0.4990980
2 extratrees 0.6109769 0.6535320 0.5042730
3 variance 0.5921607 0.6657163 0.4922751
3 extratrees 0.5968306 0.6666102 0.4925443
4 variance 0.5845042 0.6728471 0.4863736
4 extratrees 0.5882696 0.6749139 0.4869029
6 variance 0.5774562 0.6787746 0.4826502
6 extratrees 0.5781233 0.6835980 0.4805376
7 variance 0.5759719 0.6800426 0.4812854
7 extratrees 0.5741989 0.6874072 0.4772021
9 variance 0.5732636 0.6821994 0.4774872
9 extratrees 0.5693255 0.6917845 0.4752408
10 variance 0.5719643 0.6832542 0.4774101
10 extratrees 0.5683003 0.6925570 0.4733990
12 variance 0.5723502 0.6822282 0.4790659
12 extratrees 0.5641502 0.6957053 0.4717005
13 variance 0.5710778 0.6833039 0.4770955
13 extratrees 0.5632266 0.6970135 0.4702682
15 variance 0.5707742 0.6833333 0.4768475
15 extratrees 0.5596436 0.6994116 0.4676028
Tuning parameter 'min.node.size' was held constant at a value of 5
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 15, splitrule =
extratrees and min.node.size = 5.
According to the output, what criterion was used to pick the best submodel? What is the value of this criterion?
# Type your code and comments inside the code chunk
# Viewing results from mod_RF
mtry splitrule min.node.size
20 15 extratrees 5
Using RMSE with mtry = 5, the best model has a RSME value of 0.5596436.
Use the function RMSE
along with the predict
function to find the root mean square for the testing
# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
RMSE_RF <- RMSE(predict(mod_RF, testingTrans), testingTrans$brozek_C)
[1] 0.5873907
Among the models created from Problem 6 - Problem 10 (mod_FS
, mod_BE
, mod_EN
, mod_TR
, and mod_RF
), which do you think is best for predicting body fat and why?
# Type your code and comments inside the code chunk
# Creating resamples list named mods
[1] 0.6117522 0.6117522 0.6246849 0.7093008 0.5873907
Among the models created from Problem 6 - Problem 10, it seems like the best model for predicting body fat is the Random Forest model, mod_RF. We think this because its RMSE is the lowest out of all of the model’s RMSE values.
Many statistical algorithms work better on transformed variables; however, the user whether a nurse, physical therapist, or physician should be able to use your proposed model without resorting to a spreadsheet or calculator. Consequently, no transformation will take place in the models you will fit in this question. Repeat Problem 6 through Problem 10 using the untransformed data in training
and testing
you created in Problem 3. Make sure to give new names to your new models that use the un-transformed data.
Use the corrplot()
function from the corrplot
package written by Wei and Simko (2017) to identify predictors that may be linearly related in training
# Type your code and comments inside the code chunk
# Identifying linearly related predictors Problem 6
cor <- cor(training)
corrplot(cor, method = "number")
cm <- cor(x = training$abdomen_cm, y=training$brozek_C)
wrist <- cor(x = training$abdomen_wrist, y=training$brozek_C)
Use the train()
function with method = "leapForward"
, tuneLength = 10
and assign the object myControl
to the trControl
argument of the train()
function to fit a forward selection model where the goal is to predict body fat. Use brozek_C
as the response and store the results of train()
in mod_FS2
. Use set.seed(42)
for reproducibility. Do not include any predictors that are perfectly correlated.
# Type your code and comments inside the code chunk
# Fit model with forward stepwise selection
mod_FS2 <- train(brozek_C ~ . -age -abdomen_cm,
data = training,
method = "leapForward",
tuneLength = 10,
trControl = myControl)
Print mod_FS2
to the R
# Type your code and comments inside the code chunk
# Printing mod_FS2
Linear Regression with Forward Selection
203 samples
17 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ...
Resampling results across tuning parameters:
nvmax RMSE Rsquared MAE
2 3.975510 0.7172364 3.328440
3 4.013375 0.7136037 3.350542
4 4.013338 0.7128407 3.356147
5 4.048652 0.7087094 3.369878
6 4.049642 0.7092520 3.365529
7 4.017277 0.7142455 3.346466
8 4.009138 0.7156374 3.330654
9 3.976430 0.7204939 3.292605
10 3.966893 0.7225691 3.290474
11 3.965347 0.7223120 3.284813
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 11.
Using the output in your console, what criterion has been used to pick the best submodel? What is the value of the criterion that has been used? How many predictor variables are selected?
# Type your code and comments inside the code chunk
# Isolating results from mod_FS2
10 11
Using RMSE with nvmax = 11, the best model has a RSME value of 0.5596436.
Use the train()
function with method = "leapBackward"
, tuneLength = 10
and assign the object myControl
to the trControl
argument of the train()
function to fit a backward elimination model where the goal is to predict body fat. Use brozek_C
as the response and store the results of train()
in mod_BE2
. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.
# Type your code and comments inside the code chunk
# Fit model with backwards stepwise selection Problem 7
# with untransformed data
mod_BE2 <- train(brozek_C ~ . -age -abdomen_cm,
data = training,
method = "leapBackward",
tuneLength = 10,
trControl = myControl)
Print mod_BE2
to the R
# Type your code and comments inside the code chunk
# Printing mod_BE2
Linear Regression with Backwards Selection
203 samples
17 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ...
Resampling results across tuning parameters:
nvmax RMSE Rsquared MAE
2 4.002785 0.7160759 3.322765
3 3.940370 0.7245924 3.277038
4 4.024413 0.7137547 3.356219
5 3.975481 0.7210802 3.313074
6 3.997598 0.7180677 3.318327
7 4.024611 0.7153761 3.338106
8 4.017507 0.7168306 3.334540
9 4.008288 0.7181004 3.325982
10 3.996402 0.7191567 3.315898
11 3.985836 0.7195863 3.304527
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 3.
According to the output, what criterion has been used to pick the best submodel? What is the value of the criterion that has been used? How many predictor variables are selected?
# Type your code and comments inside the code chunk
# Viewing results from mod_BE
Linear Regression with Backwards Selection
203 samples
17 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ...
Resampling results across tuning parameters:
nvmax RMSE Rsquared MAE
2 0.5331715 0.7223546 0.4437283
3 0.5262643 0.7294633 0.4357332
4 0.5330154 0.7238807 0.4422245
5 0.5304839 0.7267507 0.4397319
6 0.5305875 0.7264251 0.4403598
7 0.5307921 0.7270677 0.4404149
8 0.5246361 0.7332446 0.4357663
9 0.5286987 0.7294381 0.4404245
10 0.5283786 0.7299483 0.4397340
11 0.5279656 0.7304313 0.4402907
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 8.
# Viewing final model
2 3
Using RMSE with nvmax = 8, the best model has a RSME value of 0.5246361.
Compute the RMSE for mod_BE2
using the testing data set.
# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
RMSE_BE2 <- RMSE(predict(mod_BE2, testing), testing$brozek_C)
[1] 5.045508
Use the train function with method = "glmnet"
and tuneLength = 10
to fit a constrained linear regression model named mod_EN2
. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.
# Type your code and comments inside the code chunk
# Fit constrained model Problem 8
# with untransformed data
mod_EN2 <- train(brozek_C ~ . -age -abdomen_cm,
data = training,
method = "glmnet",
tuneLength = 10,
trControl = myControl)
Print the mod_EN2
to the R
# Type your code and comments inside the code chunk
# Printing mod_EN2
203 samples
17 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ...
Resampling results across tuning parameters:
alpha lambda RMSE Rsquared MAE
0.1 0.002826336 3.898212 0.7306816 3.245262
0.1 0.006529204 3.897933 0.7307220 3.245058
0.1 0.015083308 3.895545 0.7311403 3.241223
0.1 0.034844398 3.891779 0.7317991 3.234050
0.1 0.080495081 3.884222 0.7329876 3.221861
0.1 0.185954082 3.889032 0.7324875 3.217446
0.1 0.429578060 3.933562 0.7273019 3.273984
0.1 0.992381059 4.021400 0.7185261 3.361966
0.1 2.292529014 4.258896 0.6919319 3.578695
0.1 5.296039497 4.668823 0.6471909 3.903705
0.2 0.002826336 3.903327 0.7301773 3.248850
0.2 0.006529204 3.902239 0.7303746 3.247602
0.2 0.015083308 3.898259 0.7309218 3.242272
0.2 0.034844398 3.890639 0.7320320 3.232161
0.2 0.080495081 3.881648 0.7333300 3.218666
0.2 0.185954082 3.891880 0.7321305 3.224696
0.2 0.429578060 3.928085 0.7285089 3.282415
0.2 0.992381059 4.040690 0.7178424 3.393632
0.2 2.292529014 4.349818 0.6849613 3.654502
0.2 5.296039497 4.813257 0.6520217 4.024312
0.3 0.002826336 3.906153 0.7298036 3.249005
0.3 0.006529204 3.904771 0.7300055 3.247757
0.3 0.015083308 3.898637 0.7308179 3.241826
0.3 0.034844398 3.887696 0.7324621 3.229256
0.3 0.080495081 3.879801 0.7335202 3.216224
0.3 0.185954082 3.896331 0.7315477 3.235838
0.3 0.429578060 3.928801 0.7289281 3.293221
0.3 0.992381059 4.066518 0.7159738 3.422237
0.3 2.292529014 4.400627 0.6838901 3.704633
0.3 5.296039497 4.983582 0.6586167 4.171876
0.4 0.002826336 3.903819 0.7301210 3.248358
0.4 0.006529204 3.903426 0.7301710 3.247816
0.4 0.015083308 3.898168 0.7309531 3.241826
0.4 0.034844398 3.885125 0.7328299 3.226894
0.4 0.080495081 3.879460 0.7335013 3.215697
0.4 0.185954082 3.899541 0.7311169 3.245998
0.4 0.429578060 3.934357 0.7286345 3.306616
0.4 0.992381059 4.092884 0.7138247 3.444701
0.4 2.292529014 4.444713 0.6859326 3.747502
0.4 5.296039497 5.149917 0.6736927 4.319777
0.5 0.002826336 3.903505 0.7302286 3.247733
0.5 0.006529204 3.902470 0.7303571 3.246765
0.5 0.015083308 3.897127 0.7311317 3.240569
0.5 0.034844398 3.883187 0.7330516 3.224895
0.5 0.080495081 3.880840 0.7332664 3.217132
0.5 0.185954082 3.903901 0.7305478 3.254614
0.5 0.429578060 3.945139 0.7274254 3.321369
0.5 0.992381059 4.139961 0.7080266 3.484672
0.5 2.292529014 4.492848 0.6880800 3.788181
0.5 5.296039497 5.359581 0.6829231 4.507267
0.6 0.002826336 3.903276 0.7302020 3.248133
0.6 0.006529204 3.900919 0.7304966 3.246190
0.6 0.015083308 3.896081 0.7312840 3.239674
0.6 0.034844398 3.882047 0.7331546 3.223683
0.6 0.080495081 3.883016 0.7329796 3.219828
0.6 0.185954082 3.907795 0.7300138 3.262730
0.6 0.429578060 3.956338 0.7260768 3.333083
0.6 0.992381059 4.193919 0.7005900 3.530124
0.6 2.292529014 4.529196 0.6924889 3.821134
0.6 5.296039497 5.601769 0.6908634 4.710635
0.7 0.002826336 3.904127 0.7299778 3.249239
0.7 0.006529204 3.901454 0.7303555 3.247018
0.7 0.015083308 3.892907 0.7317257 3.236682
0.7 0.034844398 3.880630 0.7333266 3.222263
0.7 0.080495081 3.885852 0.7325881 3.224287
0.7 0.185954082 3.910763 0.7295979 3.269390
0.7 0.429578060 3.969058 0.7243910 3.344746
0.7 0.992381059 4.222578 0.6970383 3.553660
0.7 2.292529014 4.579203 0.6938678 3.860845
0.7 5.296039497 5.843028 0.6940269 4.903381
0.8 0.002826336 3.906831 0.7294851 3.250907
0.8 0.006529204 3.903079 0.7301270 3.248279
0.8 0.015083308 3.890867 0.7320087 3.234609
0.8 0.034844398 3.879567 0.7334441 3.220952
0.8 0.080495081 3.888609 0.7321969 3.228650
0.8 0.185954082 3.912692 0.7292777 3.275092
0.8 0.429578060 3.982618 0.7224672 3.356549
0.8 0.992381059 4.242224 0.6953071 3.568179
0.8 2.292529014 4.637208 0.6933527 3.903227
0.8 5.296039497 6.086977 0.6942750 5.103765
0.9 0.002826336 3.907243 0.7295920 3.251813
0.9 0.006529204 3.904005 0.7300972 3.248767
0.9 0.015083308 3.888892 0.7322782 3.232773
0.9 0.034844398 3.879209 0.7334749 3.220375
0.9 0.080495081 3.891159 0.7318504 3.232807
0.9 0.185954082 3.914933 0.7289410 3.280292
0.9 0.429578060 3.994135 0.7208662 3.364936
0.9 0.992381059 4.258793 0.6939615 3.578897
0.9 2.292529014 4.684735 0.6939008 3.936921
0.9 5.296039497 6.373696 0.6942750 5.345038
1.0 0.002826336 3.907552 0.7295530 3.252845
1.0 0.006529204 3.905273 0.7299595 3.249799
1.0 0.015083308 3.887443 0.7324695 3.231457
1.0 0.034844398 3.879537 0.7334220 3.220381
1.0 0.080495081 3.892855 0.7316197 3.236196
1.0 0.185954082 3.916939 0.7286731 3.284792
1.0 0.429578060 4.005379 0.7194001 3.373151
1.0 0.992381059 4.268706 0.6933990 3.583624
1.0 2.292529014 4.734330 0.6942750 3.979685
1.0 5.296039497 6.713941 0.6942750 5.629564
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0.9 and lambda
= 0.0348444.
According to the output, what criterion was used to pick the best submodel? What is the value of this criterion? Plot the object mod_EN2
# Type your code and comments inside the code chunk
# Viewing results from mod_EN2
alpha lambda
84 0.9 0.0348444
Type your complete sentence answer here using inline R code and delete this comment.
Compute the RMSE for mod_EN2
using the testing
data set.
# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
RMSE_EN2 <- RMSE(predict(mod_EN2, testing), testing$brozek_C)
[1] 5.257586
Use the train()
function with method = "rpart"
, tuneLength = 10
along with the myControl
as the trControl
to fit a regression tree named mod_TR2
. Use set.seed(42)
for reproducibility. Do not include any predictors that are perfectly correlated.
# Type your code and comments inside the code chunk
# Fit a regression tree Problem 9
# with untransformed data
mod_TR2 <- train(brozek_C ~ . -age -abdomen_cm,
data = training,
method = "rpart",
tuneLength = 10,
trControl = myControl)
Print mod_TR2
to the R
# Type your code and comments inside the code chunk
# Printing mod_TR2
203 samples
17 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.007703033 4.916433 0.5847639 4.025387
0.008433471 4.907655 0.5845893 4.025778
0.010116937 4.846352 0.5935986 3.983069
0.011548602 4.847237 0.5928790 3.990194
0.015615982 4.892275 0.5862508 4.046665
0.020312130 4.881305 0.5806739 4.040009
0.025576073 4.798975 0.5945769 3.963676
0.036213572 4.883539 0.5790766 3.986136
0.097234710 5.401180 0.4831269 4.329337
0.525027794 6.475054 0.4114001 5.348170
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.02557607.
According to the output, what criterion was used to pick the best submodel? What is the value of this criterion?
# Type your code and comments inside the code chunk
# Viewing results from mod_TR2
7 0.02557607
The best model has a RSME value of 4.798975 with a complexity parameter equal to 0.025576073.
Use the rpart()
function from the rpart
package written by Therneau and Atkinson (2018) to build the regression tree using the complexity parameter (cp
) value from mod_TR2
above. Name this tree mod_TR4
# Type your code and comments inside the code chunk
# Building regression tree using rpart
mod_TR4 <- rpart(brozek_C ~ . -age -abdomen_cm,
data = training,
control = rpart.control(cp = mod_TR2$bestTune$cp, xval = 50))
Use the rpart.plot()
function from the rpart.plot
package written by Therneau and Atkinson (2018) to graph mod_TR4
# Type your code and comments inside the code chunk
# Plotting mod_TR4 with rpart.plot
After reading the tree, the best outcome is when someone’s abdomen_wrist is greater than 71 but less than 81. When this happens there is a 37% that this tree model predicts the outcome correctly.
Compute the RMSE for mod_TR4
using the testing
data set.
# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
RMSE_TR4 <- RMSE(predict(mod_TR4, testing), testing$brozek_C)
[1] 5.254253
Use the train()
function with method = "ranger"
, tuneLength = 10
along with the myControl
as the trControl
to fit a regression tree named mod_RF2
. Use set.seed(42)
for reproducibility. Do not include any predictors that are perfectly correlated.
# Type your code and comments inside the code chunk
# Fit a regression tree Problem 10
# with untransformed data
mod_RF2 <- train(brozek_C ~ . -age -abdomen_cm,
data = training,
method = "ranger",
tuneLength = 10,
trControl = myControl)
Print mod_RF2
to the R
# Type your code and comments inside the code chunk
# Printing mod_RF2
Random Forest
203 samples
17 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ...
Resampling results across tuning parameters:
mtry splitrule RMSE Rsquared MAE
2 variance 4.469335 0.6555022 3.693232
2 extratrees 4.527554 0.6534191 3.725801
3 variance 4.386340 0.6659372 3.644807
3 extratrees 4.428709 0.6652206 3.657601
4 variance 4.331055 0.6726188 3.603693
4 extratrees 4.360713 0.6738456 3.610469
6 variance 4.278248 0.6788437 3.577347
6 extratrees 4.271360 0.6848878 3.542221
7 variance 4.266490 0.6801720 3.567629
7 extratrees 4.255429 0.6870425 3.538343
9 variance 4.248479 0.6817219 3.538063
9 extratrees 4.211828 0.6922616 3.510987
10 variance 4.243011 0.6824179 3.541788
10 extratrees 4.206633 0.6931284 3.505200
12 variance 4.245644 0.6815463 3.553926
12 extratrees 4.176676 0.6964050 3.490920
13 variance 4.235464 0.6828409 3.540009
13 extratrees 4.160651 0.6978126 3.473362
15 variance 4.233937 0.6826609 3.538610
15 extratrees 4.146548 0.6997243 3.469731
Tuning parameter 'min.node.size' was held constant at a value of 5
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 15, splitrule =
extratrees and min.node.size = 5.
According to the output, what criterion was used to pick the best submodel? What is the value of this criterion?
# Type your code and comments inside the code chunk
# Viewing results from mod_RF2
mtry splitrule min.node.size
20 15 extratrees 5
The criterion used was RSME with a value of 4.146548.
Use the function RMSE()
along with the predict()
function to find the root mean square for the testing
# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
RMSE_RF2 <- RMSE(predict(mod_RF2, testing), testing$brozek_C)
[1] 4.339785
Which model does the best job of predicting body fat?
# Type your code and comments inside the code chunk
# Creating resamples list of different models
#model_list <- list(item1 = mod_FS2, item2 = mod_BE2, item3 = mod_EN2, item4 = mod_TR2, item5 = mod_TR3, item6 = mod_TR4, item7 = mod_RF2)
#ans <- resamples(model_list)
model_list2 <- list(item1 = mod_FS2, item2 = mod_BE2, item3 = mod_EN2, item4 = mod_TR2, item7 = mod_RF2)
ans2 <- resamples(model_list2)
summary.resamples(object = ans2)
Models: item1, item2, item3, item4, item7
Number of resamples: 50
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
item1 2.194046 3.021355 3.218821 3.284813 3.675648 4.330898 0
item2 2.397646 3.037138 3.233365 3.277038 3.554873 4.475806 0
item3 2.311599 2.953437 3.211983 3.220375 3.503270 4.264038 0
item4 2.963476 3.495434 3.918600 3.963676 4.369686 5.332065 0
item7 2.719000 3.148048 3.316040 3.469731 3.843903 4.287787 0
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
item1 2.728359 3.666489 3.928735 3.965347 4.304002 5.284396 0
item2 2.825743 3.728220 3.933744 3.940370 4.278550 5.035836 0
item3 2.772695 3.643004 3.868425 3.879209 4.141040 4.978459 0
item4 3.460347 4.340325 4.726587 4.798975 5.187490 6.432799 0
item7 3.307274 3.738256 4.034858 4.146548 4.578877 5.323989 0
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
item1 0.5297948 0.6551675 0.7351440 0.7223120 0.7755229 0.9152867 0
item2 0.5602622 0.6796893 0.7265992 0.7245924 0.7725706 0.9016342 0
item3 0.5478564 0.6854955 0.7395248 0.7334749 0.7816228 0.9013204 0
item4 0.3180474 0.5361211 0.6107806 0.5945769 0.6736918 0.7866000 0
item7 0.5014774 0.6446490 0.6992479 0.6997243 0.7556276 0.8463160 0
Type your complete sentence answer here using inline R code and delete this comment.
Which model is the most practical model for someone who needs to rapidly assess a patient’s body fat?
The tree model might be the best model for someone who needs to rapidly assess a patient’s body fat because tree models are usually the easiest to read. For someone that doesn’t understand all of these different models a tree would also be the easiest to understand because of how simple they are to read. Also, since the Random Forest model takes the longest to load, that probably wouldn’t be the most practical model for someone who needs to rapidly assess a patient’s body fat.
Dowle, Matt, and Arun Srinivasan. 2019. Data.table: Extension of ‘Data.frame‘.
Hothorn, Torsten, and Achim Zeileis. 2019. Partykit: A Toolkit for Recursive Partytioning.
Johnson, Roger W. 1996. “Fitting Percentage of Body Fat to Simple Body Measurements.” Journal of Statistics Education 4 (1). doi:10.1080/10691898.1996.11910505.
Milborrow, Stephen. 2018. Rpart.plot: Plot ’Rpart’ Models: An Enhanced Version of ’Plot.rpart’.
Therneau, Terry, and Beth Atkinson. 2018. Rpart: Recursive Partitioning and Regression Trees.
Wei, Taiyun, and Viliam Simko. 2017. Corrplot: Visualization of a Correlation Matrix.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2019. Dplyr: A Grammar of Data Manipulation.