Body Fat Project

This material is released under an Attribution-NonCommercial-ShareAlike 3.0 United States license. Original author: Alan T. Arnholt

Follow all directions. Type complete sentences to answer all questions inside the answer tags provided in the R Markdown document. Round all numeric answers you report inside the answer tags to four decimal places. Use inline R code to report numeric answers inside the answer tags (i.e. do not hard code your numeric answers).

The article by Johnson (1996) defines bodyfat determined with the siri and brozek methods as well as fat free weight using equations (1), (2), and (3), respectively.

\[\begin{equation} \text{bodyfatSiri} = \frac{457}{\text{density}} - 414.2 \tag{1} \end{equation}\] \[\begin{equation} \text{bodyfatBrozek} = \frac{495}{\text{density}} - 450 \tag{2} \end{equation}\] \[\begin{equation} \text{FatFreeWeight} = \left(1 -\frac{\text{brozek}}{100}\times \text{weight_lbs}\right) \tag{3} \end{equation}\]

Body Mass Index (BMI) is defined as

\[\text{BMI} = \frac{\text{kg}}{\text{m}^2}\] Please use the following conversion factors with this project: 0.453592 kilos per pound and 2.54 centimeters per inch.

Use the original data from http://jse.amstat.org/datasets/fat.dat.txt and evaluate the quality of the data. Specifically, start by using the fread() function from the data.table package written by Dowle and Srinivasan (2019) to read the data from http://jse.amstat.org/datasets/fat.dat.txt into an object named bodyfat. Pass the following vector of names to the col.names argument of fread(): c("case", "brozek", "siri", "density", "age", "weight_lbs", "height_in", "bmi", "fat_free_weight", "neck_cm", "chest_cm", "abdomen_cm", "hip_cm", "thigh_cm", "knee_cm", "ankle_cm", "biceps_cm", "forearm_cm", "wrist_cm")
```
```r
# Type your code and comments inside the code chunk
# Obtaining the original data
library(data.table)
names <- c("case", "brozek", "siri", "density", "age", "weight_lbs", "height_in", "bmi", "fat_free_weight", "neck_cm", "chest_cm", "abdomen_cm", "hip_cm", "thigh_cm", "knee_cm", "ankle_cm", "biceps_cm", "forearm_cm", "wrist_cm")
bodyfat <- fread("http://jse.amstat.org/datasets/fat.dat.txt", col.names = names)
```
```
- Create plotly interactive scatterplots of brozek versus density with case mapped to color, weight_lbs versus height_in with case mapped to color, and ankle_cm versus weight_lbs with case mapped to color to help identify potential outliers. How many values do you think are potentially data entry errors? Explain your reasoning and show the code you used to identify the errors.
```
# Type your code and comments inside the code chunk
# Creating interactive scatterplot of brozek versus density

library(plotly)
p <- ggplot(data = bodyfat, aes(x = density, y = brozek,
            color = case)) +
            geom_point() +
            theme_bw()
g <- ggplotly(p)
g
```
  Figure 1: Plot of brozek versus density
```
bodyfat[c(48, 76, 96, 42, 182, 216),
  c("density", "brozek", "siri", "height_in", "weight_lbs")]
```
```
   density brozek siri height_in weight_lbs
1:  1.0665    6.4  5.6     71.25     148.50
2:  1.0666   18.3 18.5     67.50     148.25
3:  1.0991   17.3 17.4     77.75     224.50
4:  1.0250   31.7 32.9     29.50     205.00
5:  1.1089    0.0  0.0     68.00     118.50
6:  0.9950   45.1 47.5     64.00     219.00
```
```
bodyfat <- bodyfat %>% mutate(brozek_eq = (495 / density) - 450 )

bodyfat<- bodyfat %>% mutate(broDiff = brozek - brozek_eq)

bodyfat %>%
  filter(abs(broDiff) >2)
```
```
  case brozek siri density age weight_lbs height_in  bmi fat_free_weight
1   48    6.4  5.6  1.0665  39     148.50     71.25 20.6           139.0
2   76   18.3 18.5  1.0666  61     148.25     67.50 22.9           121.1
3   96   17.3 17.4  1.0991  53     224.50     77.75 26.1           185.7
4  182    0.0  0.0  1.1089  40     118.50     68.00 18.1           118.5
5  216   45.1 47.5  0.9950  51     219.00     64.00 37.6           120.2
  neck_cm chest_cm abdomen_cm hip_cm thigh_cm knee_cm ankle_cm biceps_cm
1    34.6     89.8       79.5   92.7     52.7    37.5     21.9      28.8
2    36.0     91.6       81.8   94.8     54.5    37.0     21.4      29.3
3    41.1    113.2       99.2  107.5     61.7    42.3     23.2      32.9
4    33.8     79.3       69.4   85.0     47.2    33.5     20.2      27.7
5    41.2    119.8      122.1  112.8     62.5    36.9     23.6      34.7
  forearm_cm wrist_cm  brozek_eq   broDiff
1       26.8     17.9 14.1350211 -7.735021
2       27.0     18.3 14.0915057  4.208494
3       30.8     20.4  0.3684833 16.931517
4       24.6     16.5 -3.6116873  3.611687
5       29.1     18.4 47.4874372 -2.387437
```
  The potential data entry errors include the cases that lie outside of the line since the formula for brozek is a linear transformation of density. These outliers include cases 96, 76, 48, 182. Case 216 is most likely a data entry error as well. We came to these conclusions because after creating a new varible called brozek_eq, that computes what the exact brozek variable should be from the brozek equation, we found which cases had an absolute value diffrence greater than 2 between the two brozek varibels (both brozek and brozek_eq). This is shown in our code above.
```
plot_ly(data = bodyfat, x = ~brozek, y = ~density,
        marker = list(size = 5,
               color = ~case,
               line = list(color = ~case,
                           width = 1)))
```
```
# Type your code and comments inside the code chunk
# Creating interactive scatterplot of weight_lbs versus height_in
    p <- ggplot(data = bodyfat, aes(x = weight_lbs, y = height_in,
    color = case)) +
    geom_point() +
    theme_bw()

g <- ggplotly(p)
g
```
  Figure 2: Plot of weight_lbs versus height_in
  
  There are a few outliers in this graph, however, there seems to only be one data entry error. This data entry error is case 42. This case is practically impossible with a height of 29.5 inches and a wieght of 205 lbs. The rest of the data are posisble combinations of height and weight. We expect this relationship to have some variability as one isnt computed directly from the other.
```
plot_ly(data = bodyfat, x = ~weight_lbs, y = ~height_in,
        marker = list(size = 5,
               color = ~case,
               line = list(color = ~case,
                           width = 1)))
```
```
# Type your code and comments inside the code chunk
# Isolating points of interest

# Points of interest for brozek vs. density graph
bodyfat[c(48, 76, 96, 42, 182, 216),
  c("density", "brozek", "siri", "height_in", "weight_lbs" )]
```
```
    density brozek siri height_in weight_lbs
48   1.0665    6.4  5.6     71.25     148.50
76   1.0666   18.3 18.5     67.50     148.25
96   1.0991   17.3 17.4     77.75     224.50
42   1.0250   31.7 32.9     29.50     205.00
182  1.1089    0.0  0.0     68.00     118.50
216  0.9950   45.1 47.5     64.00     219.00
```
```
# Points of interest for height_in vs. weight_lbs graph
bodyfat[c(42),
  c("density", "brozek", "siri", "height_in", "weight_lbs" )]
```
```
   density brozek siri height_in weight_lbs
42   1.025   31.7 32.9      29.5        205
```
```
# Type your code and comments inside the code chunk
# Replacing identified typos of density and height_in


p <- ggplot(data = bodyfat, aes(x = density, y = brozek,
            color = case)) +
            geom_point() +
            theme_bw()
g <- ggplotly(p)
g
```
```
# Updating computed bodyfat values and bmi measurements
```
```
# Type your code and comments inside the code chunk
# Creating interactive scatterplot of ankle_cm versus weight_lbs

p <- ggplot(data = bodyfat, aes(x = ankle_cm, y = weight_lbs,
    color = case)) +
    geom_point() +
    theme_bw()

g <- ggplotly(p)
g
```
  Figure 3: Interactive scatterplot of ankle_cm versus weight_lbs
  
  It looks like cases 31 and 86 could be data entry errors. Case 39 (top) is likely not an entry error, but possibly just an outlier. Cases 31 and 86 show abnormally large ankle diameters with average weights. The data entries should probably be 23.9 (case 31) and 23.7 (case 86).
```
# Type your code and comments inside the code chunk
# Creating interactive scatterplot of ankle_cm versus weight_lbs

plot_ly(data = bodyfat, x = ~ankle_cm, y = ~weight_lbs,
        marker = list(size = 5,
               color = ~case,
               line = list(color = ~case,
                           width = 1)))
```
  Figure 4: Interactive scatterplot of ankle_cm versus weight_lbs
```
# Type your code and comments inside the code chunk
# Replacing identified typos in ankle_cm

bodyfat$ankle_cm[31] <- 23.9 
bodyfat$ankle_cm[86] <- 23.7

p <- ggplot(data = bodyfat, aes(x = ankle_cm, y = weight_lbs,
            color = case)) +
            geom_point() +
            theme_bw()

g <- ggplotly(p)
g
```
```
# Type your code and comments inside the code chunk
# Identifying bodyfat typos for brozek and siri

 p <- ggplot(data = bodyfat, aes(x = brozek, y = siri,
    color = case)) +
    geom_point() +
    theme_bw()

g <- ggplotly(p)
g
```
```
# Type your code and comments inside the code chunk
# Number of rounding discrepancies for siri

bodyfat<- bodyfat %>%
mutate(siri_eq = round((457/density - 414.2),1))

sum(bodyfat$siri != bodyfat$siri_eq)
```
```
[1] 242
```
```
# Number of rounding discrepancies for brozek

sum(bodyfat$brozek != bodyfat$brozek_eq)
```
```
[1] 252
```
```
# Number of rounding discrepancies for bmi

height_m <- (bodyfat$height_in * 2.54) / 100

bodyfat<- bodyfat %>%
mutate(bmi_eq = round(((weight_lbs * 0.453592) / ((height_m) ^ 2)),1))

sum(bodyfat$bmi != bodyfat$bmi_eq)
```
```
[1] 99
```
  Case 182 is a typo because you can’t have 0 body fat. Case 169 is a possible type because it doesn’t follow the linear line.
  
  Both of the possible typos mentioned above are most likely rounding errors. Case 182 was most likely a very very small number and ended up getting rounded to 0.

Make the clean data accessible to R.

Load the file bodyfatClean.csv from https://github.com/alanarnholt/MISCD into your R session. Specifically, use the read.csv() function to load the file bodyfatClean.csv into your current R session naming the object cleaned_bf. Since GitHub stores the file as html, click on the raw button to obtain a *.csv file.
```
# Type your code and comments inside the code chunk
# Read in clean data
library(dplyr)
cleaned_bf <- read.csv("https://raw.githubusercontent.com/alanarnholt/MISCD/master/bodyfatClean.csv")
```

Use the glimpse() function from the dplyr package written by Wickham et al. (2019) to view the structure of cleaned_bf.

# Type your code and comments inside the code chunk
# Examining the object cleaned_bf
glimpse(cleaned_bf)

Observations: 251
Variables: 18
$ age           <int> 23, 22, 22, 26, 24, 24, 26, 25, 25, 23, 26, 27, 32…
$ weight_lbs    <dbl> 154.25, 173.25, 154.00, 184.75, 184.25, 210.25, 18…
$ height_in     <dbl> 67.75, 72.25, 66.25, 72.25, 71.25, 74.75, 69.75, 7…
$ neck_cm       <dbl> 36.2, 38.5, 34.0, 37.4, 34.4, 39.0, 36.4, 37.8, 38…
$ chest_cm      <dbl> 93.1, 93.6, 95.8, 101.8, 97.3, 104.5, 105.1, 99.6,…
$ abdomen_cm    <dbl> 85.2, 83.0, 87.9, 86.4, 100.0, 94.4, 90.7, 88.5, 8…
$ hip_cm        <dbl> 94.5, 98.7, 99.2, 101.2, 101.9, 107.8, 100.3, 97.1…
$ thigh_cm      <dbl> 59.0, 58.7, 59.6, 60.1, 63.2, 66.0, 58.4, 60.0, 62…
$ knee_cm       <dbl> 37.3, 37.3, 38.9, 37.3, 42.2, 42.0, 38.3, 39.4, 38…
$ ankle_cm      <dbl> 21.9, 23.4, 24.0, 22.8, 24.0, 25.6, 22.9, 23.2, 23…
$ biceps_cm     <dbl> 32.0, 30.5, 28.8, 32.4, 32.2, 35.7, 31.9, 30.5, 35…
$ forearm_cm    <dbl> 27.4, 28.9, 25.2, 29.4, 27.7, 30.6, 27.8, 29.0, 31…
$ wrist_cm      <dbl> 17.1, 18.2, 16.6, 18.2, 17.7, 18.8, 17.7, 18.8, 18…
$ brozek_C      <dbl> 12.6, 6.9, 24.6, 10.9, 27.8, 20.5, 19.0, 12.7, 5.1…
$ bmi_C         <dbl> 23.6, 23.3, 24.7, 24.9, 25.5, 26.5, 26.2, 23.5, 24…
$ age_sq        <int> 529, 484, 484, 676, 576, 576, 676, 625, 625, 529, …
$ abdomen_wrist <dbl> 68.1, 64.8, 71.3, 68.2, 82.3, 75.6, 73.0, 69.7, 64…
$ am            <dbl> 181.9365, 169.1583, 195.5067, 182.7203, 190.6993, …

Partition the data.
- Use the creatDataPartition() function from the caret package to partition the data in to training and testing. Use 80% of the data for training and 20% for testing. To ensure reproducibility of the partition, use set.seed(314). The response variable you want to use is brozek_C (the computed brozek based on the reported density).
```
# Type your code and comments inside the code chunk
# Partitioning the data
library(caret)
set.seed(314)
in_train <- createDataPartition(cleaned_bf$brozek_C, p = 0.8, list = FALSE)
training <- cleaned_bf[in_train, ]
testing <- cleaned_bf[-in_train, ]
```
- Use the dim() function to verify the sizes of training and testing data sets.
```
# Type your code and comments inside the code chunk
# Verifying dimensions of training and testing
dim(training)
```
```
[1] 203  18
```
```
dim(testing)
```
```
[1] 48 18
```
  There are 203 observations and 18 varibles in the training dataset. There are 48 observations and 18 variables in the testing dataset.

Transform the data.
- Use the preProcess() function to transform the predictors that are in the training data set. Specifically, pass a vector with "center", "scale", and "BoxCox" to the method argument of preProcess(). Make sure not to transform the response (brozek_C).
```
# Type your code and comments inside the code chunk
# Transforming the data
training_pp <- preProcess(training, method = c("center", "scale", "BoxCox"))
```
- Use the predict() function to construct a transformed training set and a transformed testing set. Name the new transformed data sets trainingTrans and testingTrans, respectively.
```
# Type your code and comments inside the code chunk
# Creating trainingTrans and testingTrans
trainingTrans <- predict(training_pp, training)
testingTrans <- predict(training_pp, testing)
```
Use the trainControl() function to define the resampling method (repeated cross-validation), the number of resampling iterations (10), and the number of repeats or complete sets to generate (5), storing the results in the object myControl.
```
```r
# Type your code and comments inside the code chunk
# Define the type of resampling
myControl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
```
```

Fit a linear regression model using forward stepwise selection.

Use the corrplot() function from the corrplot package written by Wei and Simko (2017) to identify predictors that may be linearly related in trainingTrans. Are any of the variables colinear? If so, remove the predictor that is least correlated to the response variable. Note that when method = "number" is used with corrplot(), color coded numerical correlations are displayed.
```
# Type your code and comments inside the code chunk
# Identifying linearly related predictors
library(corrplot)
cor <- cor(trainingTrans)
corrplot(cor, method = "number")
```
```
cm <- cor(x = trainingTrans$abdomen_cm, y=trainingTrans$brozek_C)
wrist <- cor(x = trainingTrans$abdomen_wrist, y=trainingTrans$brozek_C)
```
Age and age_sq are colinear, as well as abdoment_wrist and abdomen_cm.
Use the train() function with method = "leapForward", tuneLength = 10 and assign the object myControl to the trControl argument of the train() function to fit a forward selection model where the goal is to predict body fat. Use brozek_C as the response and store the results of train() in mod_FS. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.
```
# Type your code and comments inside the code chunk
# Fit model with forward stepwise selection
set.seed(42)
mod_FS <- train(brozek_C ~ . -age -abdomen_cm,
               trainingTrans,
               method = "leapForward",
               tuneLength = 10,
               trControl = myControl)
```

Print mod_FS to the R console.

# Type your code and comments inside the code chunk
# Printing mod_FS
print(mod_FS)

Linear Regression with Forward Selection 

203 samples
 17 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
Resampling results across tuning parameters:

  nvmax  RMSE       Rsquared   MAE      
   2     0.5506026  0.7033709  0.4578342
   3     0.5383992  0.7165354  0.4511958
   4     0.5334791  0.7221570  0.4462905
   5     0.5356176  0.7216630  0.4476914
   6     0.5343806  0.7226373  0.4450996
   7     0.5335540  0.7249148  0.4429665
   8     0.5278222  0.7309489  0.4384376
   9     0.5287631  0.7292929  0.4399173
  10     0.5294384  0.7279965  0.4399193
  11     0.5280666  0.7291664  0.4398026

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 8.

Using the output in your console, what criterion has been used to pick the best submodel? What is the value of the criterion that has been used? How many predictor variables are selected?
```
# Type your code and comments inside the code chunk
# Isolating results from mod_FS
mod_FS$bestTune
```
```
  nvmax
7     8
```
Using RMSE with 8 predictors, the best model has a RSME value of .52878222.

Use the summary() function to find out which predictors are selected as the final submodel.

# Type your code and comments inside the code chunk
# Viewing final model
summary(mod_FS)

Subset selection object
15 Variables  (and intercept)
              Forced in Forced out
weight_lbs        FALSE      FALSE
height_in         FALSE      FALSE
neck_cm           FALSE      FALSE
chest_cm          FALSE      FALSE
hip_cm            FALSE      FALSE
thigh_cm          FALSE      FALSE
knee_cm           FALSE      FALSE
ankle_cm          FALSE      FALSE
biceps_cm         FALSE      FALSE
forearm_cm        FALSE      FALSE
wrist_cm          FALSE      FALSE
bmi_C             FALSE      FALSE
age_sq            FALSE      FALSE
abdomen_wrist     FALSE      FALSE
am                FALSE      FALSE
1 subsets of each size up to 8
Selection Algorithm: forward
         weight_lbs height_in neck_cm chest_cm hip_cm thigh_cm knee_cm
1  ( 1 ) " "        " "       " "     " "      " "    " "      " "    
2  ( 1 ) "*"        " "       " "     " "      " "    " "      " "    
3  ( 1 ) "*"        " "       " "     " "      " "    " "      " "    
4  ( 1 ) "*"        " "       " "     " "      " "    " "      " "    
5  ( 1 ) "*"        " "       " "     " "      " "    " "      " "    
6  ( 1 ) "*"        " "       "*"     " "      " "    " "      " "    
7  ( 1 ) "*"        " "       "*"     "*"      " "    " "      " "    
8  ( 1 ) "*"        " "       "*"     "*"      " "    " "      " "    
         ankle_cm biceps_cm forearm_cm wrist_cm bmi_C age_sq abdomen_wrist
1  ( 1 ) " "      " "       " "        " "      " "   " "    "*"          
2  ( 1 ) " "      " "       " "        " "      " "   " "    "*"          
3  ( 1 ) " "      " "       " "        "*"      " "   " "    "*"          
4  ( 1 ) " "      " "       " "        "*"      " "   "*"    "*"          
5  ( 1 ) " "      " "       " "        "*"      "*"   "*"    "*"          
6  ( 1 ) " "      " "       " "        "*"      "*"   "*"    "*"          
7  ( 1 ) " "      " "       " "        "*"      "*"   "*"    "*"          
8  ( 1 ) " "      " "       " "        "*"      "*"   "*"    "*"          
         am 
1  ( 1 ) " "
2  ( 1 ) " "
3  ( 1 ) " "
4  ( 1 ) " "
5  ( 1 ) " "
6  ( 1 ) " "
7  ( 1 ) " "
8  ( 1 ) "*"

The varibles weight_lbs, height_in, neck_cm, chest_cm, hip_cm, thigh_cm, knee_cm, ankle_cm, biceps_cm, forearm_cm, wrist_cm, bmi_C, age_sg, abdomen_wrist, and am are the predictors selected as the final submodel.

Compute the RMSE for mod_FS using the testing data set.

# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
RMSE_FS <- RMSE(predict(mod_FS, testingTrans), testingTrans$brozek_C)
RMSE_FS

[1] 0.6117522

Fit a linear regression model using backward stepwise selection.

Use the train() function with method = "leapBackward", tuneLength = 10 and assign the object myControl to the trControl argument of the train() function to fit a backward elimination model where the goal is to predict body fat. Use brozek_C as the response and store the results of train() in mod_BE. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.
```
# Type your code and comments inside the code chunk
# Fit model with backwards stepwise selection
set.seed(42)
mod_BE <- train(brozek_C ~ . -age -abdomen_cm, 
                trainingTrans, 
                method = "leapBackward", 
                tuneLength = 10, 
                trControl = myControl)
```

Print mod_BE to the R console.

# Type your code and comments inside the code chunk
# Printing mod_BE
print(mod_BE)

Linear Regression with Backwards Selection 

203 samples
 17 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
Resampling results across tuning parameters:

  nvmax  RMSE       Rsquared   MAE      
   2     0.5331715  0.7223546  0.4437283
   3     0.5262643  0.7294633  0.4357332
   4     0.5330154  0.7238807  0.4422245
   5     0.5304839  0.7267507  0.4397319
   6     0.5305875  0.7264251  0.4403598
   7     0.5307921  0.7270677  0.4404149
   8     0.5246361  0.7332446  0.4357663
   9     0.5286987  0.7294381  0.4404245
  10     0.5283786  0.7299483  0.4397340
  11     0.5279656  0.7304313  0.4402907

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 8.

According to the output, what criterion has been used to pick the best submodel? What is the value of the criterion that has been used? How many predictor variables are selected?
```
# Type your code and comments inside the code chunk
# Viewing results from mod_BE
mod_BE$bestTune
```
```
  nvmax
7     8
```
The criterion for picking the best model is the RMSE with 8 variables. The value of the RSME is 0.5246361.

Use the summary() function to find out which predictors are selected as the final submodel.

# Type your code and comments inside the code chunk
# Viewing final model
summary(mod_BE)

Subset selection object
15 Variables  (and intercept)
              Forced in Forced out
weight_lbs        FALSE      FALSE
height_in         FALSE      FALSE
neck_cm           FALSE      FALSE
chest_cm          FALSE      FALSE
hip_cm            FALSE      FALSE
thigh_cm          FALSE      FALSE
knee_cm           FALSE      FALSE
ankle_cm          FALSE      FALSE
biceps_cm         FALSE      FALSE
forearm_cm        FALSE      FALSE
wrist_cm          FALSE      FALSE
bmi_C             FALSE      FALSE
age_sq            FALSE      FALSE
abdomen_wrist     FALSE      FALSE
am                FALSE      FALSE
1 subsets of each size up to 8
Selection Algorithm: backward
         weight_lbs height_in neck_cm chest_cm hip_cm thigh_cm knee_cm
1  ( 1 ) " "        " "       " "     " "      " "    " "      " "    
2  ( 1 ) " "        " "       " "     " "      " "    " "      " "    
3  ( 1 ) " "        " "       " "     " "      " "    " "      " "    
4  ( 1 ) " "        " "       " "     " "      " "    " "      " "    
5  ( 1 ) " "        " "       "*"     " "      " "    " "      " "    
6  ( 1 ) " "        " "       "*"     "*"      " "    " "      " "    
7  ( 1 ) " "        " "       "*"     "*"      " "    " "      " "    
8  ( 1 ) "*"        " "       "*"     "*"      " "    " "      " "    
         ankle_cm biceps_cm forearm_cm wrist_cm bmi_C age_sq abdomen_wrist
1  ( 1 ) " "      " "       " "        " "      " "   " "    "*"          
2  ( 1 ) " "      " "       " "        "*"      " "   " "    "*"          
3  ( 1 ) " "      " "       " "        "*"      " "   "*"    "*"          
4  ( 1 ) " "      " "       " "        "*"      "*"   "*"    "*"          
5  ( 1 ) " "      " "       " "        "*"      "*"   "*"    "*"          
6  ( 1 ) " "      " "       " "        "*"      "*"   "*"    "*"          
7  ( 1 ) " "      " "       " "        "*"      "*"   "*"    "*"          
8  ( 1 ) " "      " "       " "        "*"      "*"   "*"    "*"          
         am 
1  ( 1 ) " "
2  ( 1 ) " "
3  ( 1 ) " "
4  ( 1 ) " "
5  ( 1 ) " "
6  ( 1 ) " "
7  ( 1 ) "*"
8  ( 1 ) "*"

The variables weight_lbs, height_in, neck_cm, chest_cm, hip_cm, tight_cm, knee_cm, ankle_cm, biceps_cm, forearm_cm, wrist_cm, bmi_C, age_sq, abdomen_wrist, and am are selcted as the predictors for the final submodel.

Compute the RMSE for mod_BE using the testing data set.

# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
RMSE_BE <- RMSE(predict(mod_BE, testingTrans), testingTrans$brozek_C)
RMSE_BE

[1] 0.6117522

Fit a constrained linear regression model.

Use the train function with method = "glmnet" and tuneLength= 10 to fit a constrained linear regression model named mod_EN. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.

# Type your code and comments inside the code chunk
# Fit constrained model (elastic net)
set.seed(42)
mod_EN <- train(brozek_C ~ . -age_sq -abdomen_cm, 
                data = trainingTrans, 
                method = "glmnet", 
                tuneLength = 10, 
                trControl = myControl)

Print mod_EN to the R console.

# Type your code and comments inside the code chunk
# Printing mod_EN
print(mod_EN)

glmnet 

203 samples
 17 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
Resampling results across tuning parameters:

  alpha  lambda        RMSE       Rsquared   MAE      
  0.1    0.0003820240  0.5231616  0.7334311  0.4329046
  0.1    0.0008825249  0.5231413  0.7334561  0.4328746
  0.1    0.0020387471  0.5228081  0.7338136  0.4323957
  0.1    0.0047097703  0.5223652  0.7341611  0.4317239
  0.1    0.0108801806  0.5219756  0.7344251  0.4309068
  0.1    0.0251346291  0.5234142  0.7329996  0.4317356
  0.1    0.0580642545  0.5272526  0.7298459  0.4365018
  0.1    0.1341359623  0.5378889  0.7224132  0.4476435
  0.1    0.3098714780  0.5703232  0.6959115  0.4785758
  0.1    0.7158433225  0.6267836  0.6504147  0.5257102
  0.2    0.0003820240  0.5236133  0.7330705  0.4333380
  0.2    0.0008825249  0.5230928  0.7335895  0.4327565
  0.2    0.0020387471  0.5225364  0.7340826  0.4321162
  0.2    0.0047097703  0.5220124  0.7344967  0.4314014
  0.2    0.0108801806  0.5216392  0.7347328  0.4305806
  0.2    0.0251346291  0.5233570  0.7330851  0.4323391
  0.2    0.0580642545  0.5257075  0.7320202  0.4366554
  0.2    0.1341359623  0.5400637  0.7225144  0.4520196
  0.2    0.3098714780  0.5825931  0.6894484  0.4901949
  0.2    0.7158433225  0.6468834  0.6542099  0.5420342
  0.3    0.0003820240  0.5234028  0.7332712  0.4330782
  0.3    0.0008825249  0.5229147  0.7337588  0.4325531
  0.3    0.0020387471  0.5223650  0.7342351  0.4319473
  0.3    0.0047097703  0.5217919  0.7346887  0.4311637
  0.3    0.0108801806  0.5215840  0.7347496  0.4304884
  0.3    0.0251346291  0.5230932  0.7334299  0.4328241
  0.3    0.0580642545  0.5252148  0.7332064  0.4373919
  0.3    0.1341359623  0.5446221  0.7196972  0.4572548
  0.3    0.3098714780  0.5920870  0.6850715  0.4998493
  0.3    0.7158433225  0.6704415  0.6602620  0.5592390
  0.4    0.0003820240  0.5227559  0.7339039  0.4323624
  0.4    0.0008825249  0.5225357  0.7341307  0.4321732
  0.4    0.0020387471  0.5221812  0.7344077  0.4317626
  0.4    0.0047097703  0.5216434  0.7348231  0.4309913
  0.4    0.0108801806  0.5216175  0.7347027  0.4306022
  0.4    0.0251346291  0.5226270  0.7339834  0.4330053
  0.4    0.0580642545  0.5255304  0.7334863  0.4386092
  0.4    0.1341359623  0.5486650  0.7172946  0.4613964
  0.4    0.3098714780  0.5980475  0.6871362  0.5056990
  0.4    0.7158433225  0.6931781  0.6751992  0.5805263
  0.5    0.0003820240  0.5225069  0.7341692  0.4320918
  0.5    0.0008825249  0.5223854  0.7342729  0.4320245
  0.5    0.0020387471  0.5220775  0.7344933  0.4316556
  0.5    0.0047097703  0.5215634  0.7348921  0.4308651
  0.5    0.0108801806  0.5216570  0.7346623  0.4308962
  0.5    0.0251346291  0.5221901  0.7344491  0.4332407
  0.5    0.0580642545  0.5265059  0.7328969  0.4404169
  0.5    0.1341359623  0.5550801  0.7117445  0.4676511
  0.5    0.3098714780  0.6046765  0.6891083  0.5112450
  0.5    0.7158433225  0.7216325  0.6842853  0.6065182
  0.6    0.0003820240  0.5228011  0.7339215  0.4324977
  0.6    0.0008825249  0.5224106  0.7342617  0.4320815
  0.6    0.0020387471  0.5219702  0.7345905  0.4315486
  0.6    0.0047097703  0.5214816  0.7349591  0.4307283
  0.6    0.0108801806  0.5216594  0.7346453  0.4312456
  0.6    0.0251346291  0.5218164  0.7348499  0.4336438
  0.6    0.0580642545  0.5284605  0.7312139  0.4427067
  0.6    0.1341359623  0.5630044  0.7037609  0.4744761
  0.6    0.3098714780  0.6096292  0.6937591  0.5152154
  0.6    0.7158433225  0.7546190  0.6918228  0.6344184
  0.7    0.0003820240  0.5225943  0.7340715  0.4322544
  0.7    0.0008825249  0.5223961  0.7342794  0.4320480
  0.7    0.0020387471  0.5218804  0.7346772  0.4314297
  0.7    0.0047097703  0.5214031  0.7350001  0.4305883
  0.7    0.0108801806  0.5217405  0.7345737  0.4316623
  0.7    0.0251346291  0.5215655  0.7351777  0.4339676
  0.7    0.0580642545  0.5313261  0.7284626  0.4457845
  0.7    0.1341359623  0.5681455  0.6986935  0.4790050
  0.7    0.3098714780  0.6162603  0.6955693  0.5204536
  0.7    0.7158433225  0.7881084  0.6953102  0.6610099
  0.8    0.0003820240  0.5226445  0.7340408  0.4322954
  0.8    0.0008825249  0.5223703  0.7343052  0.4320652
  0.8    0.0020387471  0.5218030  0.7347453  0.4313461
  0.8    0.0047097703  0.5213865  0.7350036  0.4305462
  0.8    0.0108801806  0.5218877  0.7344386  0.4322037
  0.8    0.0251346291  0.5216047  0.7352628  0.4344838
  0.8    0.0580642545  0.5339401  0.7258578  0.4486395
  0.8    0.1341359623  0.5708584  0.6968780  0.4813561
  0.8    0.3098714780  0.6242583  0.6950890  0.5261636
  0.8    0.7158433225  0.8211556  0.6957120  0.6881450
  0.9    0.0003820240  0.5224549  0.7342159  0.4321450
  0.9    0.0008825249  0.5223022  0.7343360  0.4319805
  0.9    0.0020387471  0.5217390  0.7347943  0.4312610
  0.9    0.0047097703  0.5213608  0.7350144  0.4305236
  0.9    0.0108801806  0.5220283  0.7342824  0.4327218
  0.9    0.0251346291  0.5218485  0.7351628  0.4351333
  0.9    0.0580642545  0.5360819  0.7237029  0.4507902
  0.9    0.1341359623  0.5731573  0.6955409  0.4829651
  0.9    0.3098714780  0.6313123  0.6950041  0.5307282
  0.9    0.7158433225  0.8599629  0.6957120  0.7207924
  1.0    0.0003820240  0.5224778  0.7342351  0.4321632
  1.0    0.0008825249  0.5222664  0.7343740  0.4319393
  1.0    0.0020387471  0.5216845  0.7348274  0.4311726
  1.0    0.0047097703  0.5212938  0.7350651  0.4304975
  1.0    0.0108801806  0.5221174  0.7341735  0.4330826
  1.0    0.0251346291  0.5224455  0.7346374  0.4360844
  1.0    0.0580642545  0.5377594  0.7220514  0.4521799
  1.0    0.1341359623  0.5747790  0.6947389  0.4838089
  1.0    0.3098714780  0.6378602  0.6957120  0.5356386
  1.0    0.7158433225  0.9060164  0.6957120  0.7594059

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.00470977.

According to the output, what criterion was used to pick the best submodel? What is the value of this criterion? Plot the object mod_EN.
```
# Type your code and comments inside the code chunk
# Viewing results from mod_EN and plotting mod_EN
plot(mod_EN)
```
Using RMSE to pick the best model, alpha has a value of 1, lambda has a value of .00470977, with RMSE having a value of .5212938.

Compute the RMSE for mod_EN using the testing data set.

# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
RMSE_EN <- RMSE(predict(mod_EN, testingTrans), testingTrans$brozek_C)
RMSE_EN

[1] 0.6246849

Fit a regression tree.

Use the train() function with method = "rpart", tuneLength = 10 along with the myControl as the trControl to fit a regression tree named mod_TR. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.

# Type your code and comments inside the code chunk
# Fit Regression Tree
set.seed(42)
mod_TR <- train(brozek_C ~ . -age -abdomen_cm, 
                data = trainingTrans, 
                method = "rpart", 
                tuneLength = 10, 
                trControl = myControl)

Print mod_TR to the R console.

# Type your code and comments inside the code chunk
# Printing mod_TR
print(mod_TR)

CART 

203 samples
 17 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
Resampling results across tuning parameters:

  cp           RMSE       Rsquared   MAE      
  0.007703033  0.6636885  0.5851521  0.5433992
  0.008433471  0.6625036  0.5849659  0.5434520
  0.010116937  0.6542281  0.5939651  0.5376865
  0.011548602  0.6543465  0.5930476  0.5386454
  0.015615982  0.6604265  0.5864218  0.5462687
  0.020312130  0.6588497  0.5808712  0.5451845
  0.025576073  0.6477331  0.5946999  0.5348745
  0.036213572  0.6592349  0.5789596  0.5380807
  0.097234710  0.7291152  0.4830263  0.5844113
  0.525027794  0.8741036  0.4114001  0.7219792

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.02557607.

According to the output, what criterion was used to pick the best submodel? What is the value of this criterion?
```
# Type your code and comments inside the code chunk
# Viewing results from mod_TR 
```
Using RSME to determine the best model, the complexity value of the model is .02557607 with an RMSE of 0.6477331.

Use the rpart() function from the rpart package written by Therneau and Atkinson (2018) to build the regression tree using the complexity parameter (cp) value from mod_TR above. Name this tree mod_TR3.

# Type your code and comments inside the code chunk
# Building regression tree using rpart
library(rpart)
mod_TR3 <- rpart(brozek_C ~ . -age -abdomen_cm, 
                 data = trainingTrans,
                 control = rpart.control(cp = mod_TR$bestTune$cp, xval = 50))

Use the plot() function from the partykit package written by Hothorn and Zeileis (2019) to graph mod_TR3.

# Type your code and comments inside the code chunk
# Plotting mod_TR3 with partykit
library(partykit)
plot(as.party(mod_TR3))

Use the rpart.plot() function from the rpart.plot package written by Milborrow (2018) to graph mod_TR3.

# Type your code and comments inside the code chunk
# Plotting mod_TR3 with rpart.plot
library(rpart.plot)
rpart.plot(mod_TR3)

What predictors are used in the graph of mod_TR3?

The predictor used in the graph of mod_TR3 is the variable abdomen_wrist.
Explain the tree.

This tree says that if someone as a abdomen_wrist less that -0.3 and less than -0.9 than there is a 20% chance that person’s abdomen_wrist tree predictions are correct. While on the other hand, if someone has a abdomen_wrist greater than -0.3, an abdomen_wrist greater than 0.77, and an abdomen_wrist greater than 1.8 there is a 3% chance that the this person’s abdomen_wrist tree predictions are correct.
According to the tree, the abdomen_wrist measurements can be negative. Is this possible? If so, explain the reason for the negative values.

The variable abdomen_wrist is equal to abdomen_cm - wrist_cm, so if abdomen wrist is negative it’s because wrist_cm is larger than abdomen_cm which is possible.

Compute the RMSE for mod_TR3 using the testing data set.

# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
RMSE_TR <- RMSE(predict(mod_TR3, testingTrans), testingTrans$brozek_C)
RMSE_TR

[1] 0.7093008

Fit a Random Forest model.

Use the train() function with method = "ranger", tuneLength = 10 along with the myControl as the trControl to fit a regression tree named mod_RF. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.

# Type your code and comments inside the code chunk
# Fit Random Forest model
set.seed(42)
mod_RF <- train(brozek_C ~ . -age -abdomen_cm, 
        data = trainingTrans, 
        method = "ranger", 
        tuneLength = 10, 
        trControl = myControl)

Print mod_RF to the R console.

# Type your code and comments inside the code chunk
# Printing mod_RF
print(mod_RF)

Random Forest 

203 samples
 17 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
Resampling results across tuning parameters:

  mtry  splitrule   RMSE       Rsquared   MAE      
   2    variance    0.6033529  0.6553213  0.4990980
   2    extratrees  0.6109769  0.6535320  0.5042730
   3    variance    0.5921607  0.6657163  0.4922751
   3    extratrees  0.5968306  0.6666102  0.4925443
   4    variance    0.5845042  0.6728471  0.4863736
   4    extratrees  0.5882696  0.6749139  0.4869029
   6    variance    0.5774562  0.6787746  0.4826502
   6    extratrees  0.5781233  0.6835980  0.4805376
   7    variance    0.5759719  0.6800426  0.4812854
   7    extratrees  0.5741989  0.6874072  0.4772021
   9    variance    0.5732636  0.6821994  0.4774872
   9    extratrees  0.5693255  0.6917845  0.4752408
  10    variance    0.5719643  0.6832542  0.4774101
  10    extratrees  0.5683003  0.6925570  0.4733990
  12    variance    0.5723502  0.6822282  0.4790659
  12    extratrees  0.5641502  0.6957053  0.4717005
  13    variance    0.5710778  0.6833039  0.4770955
  13    extratrees  0.5632266  0.6970135  0.4702682
  15    variance    0.5707742  0.6833333  0.4768475
  15    extratrees  0.5596436  0.6994116  0.4676028

Tuning parameter 'min.node.size' was held constant at a value of 5
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 15, splitrule =
 extratrees and min.node.size = 5.

According to the output, what criterion was used to pick the best submodel? What is the value of this criterion?
```
# Type your code and comments inside the code chunk
# Viewing results from mod_RF
mod_RF$bestTune
```
```
   mtry  splitrule min.node.size
20   15 extratrees             5
```
Using RMSE with mtry = 5, the best model has a RSME value of 0.5596436.

Use the function RMSE along with the predict function to find the root mean square for the testing data.

# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
RMSE_RF <- RMSE(predict(mod_RF, testingTrans), testingTrans$brozek_C)
RMSE_RF

[1] 0.5873907

Among the models created from Problem 6 - Problem 10 (mod_FS, mod_BE, mod_EN, mod_TR, and mod_RF), which do you think is best for predicting body fat and why?
```
```r
# Type your code and comments inside the code chunk
# Creating resamples list named mods
RMSE_all <- c(RMSE_FS, RMSE_BE, RMSE_EN, RMSE_TR, RMSE_RF)
RMSE_all
```

```
[1] 0.6117522 0.6117522 0.6246849 0.7093008 0.5873907
```
```
Among the models created from Problem 6 - Problem 10, it seems like the best model for predicting body fat is the Random Forest model, mod_RF. We think this because its RMSE is the lowest out of all of the model’s RMSE values.

Many statistical algorithms work better on transformed variables; however, the user whether a nurse, physical therapist, or physician should be able to use your proposed model without resorting to a spreadsheet or calculator. Consequently, no transformation will take place in the models you will fit in this question. Repeat Problem 6 through Problem 10 using the untransformed data in training and testing you created in Problem 3. Make sure to give new names to your new models that use the un-transformed data.

Use the corrplot() function from the corrplot package written by Wei and Simko (2017) to identify predictors that may be linearly related in training.

# Type your code and comments inside the code chunk
# Identifying linearly related predictors Problem 6
cor <- cor(training)
corrplot(cor, method = "number")
cm <- cor(x = training$abdomen_cm, y=training$brozek_C)
wrist <- cor(x = training$abdomen_wrist, y=training$brozek_C)

Use the train() function with method = "leapForward", tuneLength = 10 and assign the object myControl to the trControl argument of the train() function to fit a forward selection model where the goal is to predict body fat. Use brozek_C as the response and store the results of train() in mod_FS2. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.
```
# Type your code and comments inside the code chunk
# Fit model with forward stepwise selection
set.seed(42)
mod_FS2 <- train(brozek_C ~ . -age -abdomen_cm,
       data = training,
       method = "leapForward",
       tuneLength = 10,
       trControl = myControl)
```

Print mod_FS2 to the R console.

# Type your code and comments inside the code chunk
# Printing mod_FS2
print(mod_FS2)

Linear Regression with Forward Selection 

203 samples
 17 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
Resampling results across tuning parameters:

  nvmax  RMSE      Rsquared   MAE     
   2     3.975510  0.7172364  3.328440
   3     4.013375  0.7136037  3.350542
   4     4.013338  0.7128407  3.356147
   5     4.048652  0.7087094  3.369878
   6     4.049642  0.7092520  3.365529
   7     4.017277  0.7142455  3.346466
   8     4.009138  0.7156374  3.330654
   9     3.976430  0.7204939  3.292605
  10     3.966893  0.7225691  3.290474
  11     3.965347  0.7223120  3.284813

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 11.

Using the output in your console, what criterion has been used to pick the best submodel? What is the value of the criterion that has been used? How many predictor variables are selected?
```
# Type your code and comments inside the code chunk
# Isolating results from mod_FS2
mod_FS2$bestTune
```
```
   nvmax
10    11
```
Using RMSE with nvmax = 11, the best model has a RSME value of 0.5596436.

Use the train() function with method = "leapBackward", tuneLength = 10 and assign the object myControl to the trControl argument of the train() function to fit a backward elimination model where the goal is to predict body fat. Use brozek_C as the response and store the results of train() in mod_BE2. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.

# Type your code and comments inside the code chunk
# Fit model with backwards stepwise selection Problem 7 
# with untransformed data
set.seed(42)
mod_BE2 <- train(brozek_C ~ . -age -abdomen_cm, 
                data = training, 
                method = "leapBackward", 
                tuneLength = 10, 
                trControl = myControl)

Print mod_BE2 to the R console.

# Type your code and comments inside the code chunk
# Printing mod_BE2
print(mod_BE2)

Linear Regression with Backwards Selection 

203 samples
 17 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
Resampling results across tuning parameters:

  nvmax  RMSE      Rsquared   MAE     
   2     4.002785  0.7160759  3.322765
   3     3.940370  0.7245924  3.277038
   4     4.024413  0.7137547  3.356219
   5     3.975481  0.7210802  3.313074
   6     3.997598  0.7180677  3.318327
   7     4.024611  0.7153761  3.338106
   8     4.017507  0.7168306  3.334540
   9     4.008288  0.7181004  3.325982
  10     3.996402  0.7191567  3.315898
  11     3.985836  0.7195863  3.304527

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 3.

According to the output, what criterion has been used to pick the best submodel? What is the value of the criterion that has been used? How many predictor variables are selected?

# Type your code and comments inside the code chunk
# Viewing results from mod_BE
print(mod_BE)

Linear Regression with Backwards Selection 

203 samples
 17 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
Resampling results across tuning parameters:

  nvmax  RMSE       Rsquared   MAE      
   2     0.5331715  0.7223546  0.4437283
   3     0.5262643  0.7294633  0.4357332
   4     0.5330154  0.7238807  0.4422245
   5     0.5304839  0.7267507  0.4397319
   6     0.5305875  0.7264251  0.4403598
   7     0.5307921  0.7270677  0.4404149
   8     0.5246361  0.7332446  0.4357663
   9     0.5286987  0.7294381  0.4404245
  10     0.5283786  0.7299483  0.4397340
  11     0.5279656  0.7304313  0.4402907

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 8.

# Viewing final model
mod_BE2$bestTune

  nvmax
2     3

Using RMSE with nvmax = 8, the best model has a RSME value of 0.5246361.

Compute the RMSE for mod_BE2 using the testing data set.

# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
RMSE_BE2 <- RMSE(predict(mod_BE2, testing), testing$brozek_C)
RMSE_BE2

[1] 5.045508

Use the train function with method = "glmnet" and tuneLength = 10 to fit a constrained linear regression model named mod_EN2. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.

# Type your code and comments inside the code chunk
# Fit constrained model Problem 8 
# with untransformed data
set.seed(42)
mod_EN2 <- train(brozek_C ~ . -age -abdomen_cm, 
                data = training, 
                method = "glmnet", 
                tuneLength = 10, 
                trControl = myControl)

Print the mod_EN2 to the R console.

# Type your code and comments inside the code chunk
# Printing mod_EN2
print(mod_EN2)

glmnet 

203 samples
 17 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
Resampling results across tuning parameters:

  alpha  lambda       RMSE      Rsquared   MAE     
  0.1    0.002826336  3.898212  0.7306816  3.245262
  0.1    0.006529204  3.897933  0.7307220  3.245058
  0.1    0.015083308  3.895545  0.7311403  3.241223
  0.1    0.034844398  3.891779  0.7317991  3.234050
  0.1    0.080495081  3.884222  0.7329876  3.221861
  0.1    0.185954082  3.889032  0.7324875  3.217446
  0.1    0.429578060  3.933562  0.7273019  3.273984
  0.1    0.992381059  4.021400  0.7185261  3.361966
  0.1    2.292529014  4.258896  0.6919319  3.578695
  0.1    5.296039497  4.668823  0.6471909  3.903705
  0.2    0.002826336  3.903327  0.7301773  3.248850
  0.2    0.006529204  3.902239  0.7303746  3.247602
  0.2    0.015083308  3.898259  0.7309218  3.242272
  0.2    0.034844398  3.890639  0.7320320  3.232161
  0.2    0.080495081  3.881648  0.7333300  3.218666
  0.2    0.185954082  3.891880  0.7321305  3.224696
  0.2    0.429578060  3.928085  0.7285089  3.282415
  0.2    0.992381059  4.040690  0.7178424  3.393632
  0.2    2.292529014  4.349818  0.6849613  3.654502
  0.2    5.296039497  4.813257  0.6520217  4.024312
  0.3    0.002826336  3.906153  0.7298036  3.249005
  0.3    0.006529204  3.904771  0.7300055  3.247757
  0.3    0.015083308  3.898637  0.7308179  3.241826
  0.3    0.034844398  3.887696  0.7324621  3.229256
  0.3    0.080495081  3.879801  0.7335202  3.216224
  0.3    0.185954082  3.896331  0.7315477  3.235838
  0.3    0.429578060  3.928801  0.7289281  3.293221
  0.3    0.992381059  4.066518  0.7159738  3.422237
  0.3    2.292529014  4.400627  0.6838901  3.704633
  0.3    5.296039497  4.983582  0.6586167  4.171876
  0.4    0.002826336  3.903819  0.7301210  3.248358
  0.4    0.006529204  3.903426  0.7301710  3.247816
  0.4    0.015083308  3.898168  0.7309531  3.241826
  0.4    0.034844398  3.885125  0.7328299  3.226894
  0.4    0.080495081  3.879460  0.7335013  3.215697
  0.4    0.185954082  3.899541  0.7311169  3.245998
  0.4    0.429578060  3.934357  0.7286345  3.306616
  0.4    0.992381059  4.092884  0.7138247  3.444701
  0.4    2.292529014  4.444713  0.6859326  3.747502
  0.4    5.296039497  5.149917  0.6736927  4.319777
  0.5    0.002826336  3.903505  0.7302286  3.247733
  0.5    0.006529204  3.902470  0.7303571  3.246765
  0.5    0.015083308  3.897127  0.7311317  3.240569
  0.5    0.034844398  3.883187  0.7330516  3.224895
  0.5    0.080495081  3.880840  0.7332664  3.217132
  0.5    0.185954082  3.903901  0.7305478  3.254614
  0.5    0.429578060  3.945139  0.7274254  3.321369
  0.5    0.992381059  4.139961  0.7080266  3.484672
  0.5    2.292529014  4.492848  0.6880800  3.788181
  0.5    5.296039497  5.359581  0.6829231  4.507267
  0.6    0.002826336  3.903276  0.7302020  3.248133
  0.6    0.006529204  3.900919  0.7304966  3.246190
  0.6    0.015083308  3.896081  0.7312840  3.239674
  0.6    0.034844398  3.882047  0.7331546  3.223683
  0.6    0.080495081  3.883016  0.7329796  3.219828
  0.6    0.185954082  3.907795  0.7300138  3.262730
  0.6    0.429578060  3.956338  0.7260768  3.333083
  0.6    0.992381059  4.193919  0.7005900  3.530124
  0.6    2.292529014  4.529196  0.6924889  3.821134
  0.6    5.296039497  5.601769  0.6908634  4.710635
  0.7    0.002826336  3.904127  0.7299778  3.249239
  0.7    0.006529204  3.901454  0.7303555  3.247018
  0.7    0.015083308  3.892907  0.7317257  3.236682
  0.7    0.034844398  3.880630  0.7333266  3.222263
  0.7    0.080495081  3.885852  0.7325881  3.224287
  0.7    0.185954082  3.910763  0.7295979  3.269390
  0.7    0.429578060  3.969058  0.7243910  3.344746
  0.7    0.992381059  4.222578  0.6970383  3.553660
  0.7    2.292529014  4.579203  0.6938678  3.860845
  0.7    5.296039497  5.843028  0.6940269  4.903381
  0.8    0.002826336  3.906831  0.7294851  3.250907
  0.8    0.006529204  3.903079  0.7301270  3.248279
  0.8    0.015083308  3.890867  0.7320087  3.234609
  0.8    0.034844398  3.879567  0.7334441  3.220952
  0.8    0.080495081  3.888609  0.7321969  3.228650
  0.8    0.185954082  3.912692  0.7292777  3.275092
  0.8    0.429578060  3.982618  0.7224672  3.356549
  0.8    0.992381059  4.242224  0.6953071  3.568179
  0.8    2.292529014  4.637208  0.6933527  3.903227
  0.8    5.296039497  6.086977  0.6942750  5.103765
  0.9    0.002826336  3.907243  0.7295920  3.251813
  0.9    0.006529204  3.904005  0.7300972  3.248767
  0.9    0.015083308  3.888892  0.7322782  3.232773
  0.9    0.034844398  3.879209  0.7334749  3.220375
  0.9    0.080495081  3.891159  0.7318504  3.232807
  0.9    0.185954082  3.914933  0.7289410  3.280292
  0.9    0.429578060  3.994135  0.7208662  3.364936
  0.9    0.992381059  4.258793  0.6939615  3.578897
  0.9    2.292529014  4.684735  0.6939008  3.936921
  0.9    5.296039497  6.373696  0.6942750  5.345038
  1.0    0.002826336  3.907552  0.7295530  3.252845
  1.0    0.006529204  3.905273  0.7299595  3.249799
  1.0    0.015083308  3.887443  0.7324695  3.231457
  1.0    0.034844398  3.879537  0.7334220  3.220381
  1.0    0.080495081  3.892855  0.7316197  3.236196
  1.0    0.185954082  3.916939  0.7286731  3.284792
  1.0    0.429578060  4.005379  0.7194001  3.373151
  1.0    0.992381059  4.268706  0.6933990  3.583624
  1.0    2.292529014  4.734330  0.6942750  3.979685
  1.0    5.296039497  6.713941  0.6942750  5.629564

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0.9 and lambda
 = 0.0348444.

According to the output, what criterion was used to pick the best submodel? What is the value of this criterion? Plot the object mod_EN2.
```
# Type your code and comments inside the code chunk
# Viewing results from mod_EN2
mod_EN2$bestTune
```
```
   alpha    lambda
84   0.9 0.0348444
```
Type your complete sentence answer here using inline R code and delete this comment.

Compute the RMSE for mod_EN2 using the testing data set.

# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
RMSE_EN2 <- RMSE(predict(mod_EN2, testing), testing$brozek_C)
RMSE_EN2

[1] 5.257586

Use the train() function with method = "rpart", tuneLength = 10 along with the myControl as the trControl to fit a regression tree named mod_TR2. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.

# Type your code and comments inside the code chunk
# Fit a regression tree Problem 9 
# with untransformed data
set.seed(42)
mod_TR2 <- train(brozek_C ~ . -age -abdomen_cm, 
        data = training, 
        method = "rpart", 
        tuneLength = 10, 
        trControl = myControl)

Print mod_TR2 to the R console.

# Type your code and comments inside the code chunk
# Printing mod_TR2
print(mod_TR2)

CART 

203 samples
 17 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
Resampling results across tuning parameters:

  cp           RMSE      Rsquared   MAE     
  0.007703033  4.916433  0.5847639  4.025387
  0.008433471  4.907655  0.5845893  4.025778
  0.010116937  4.846352  0.5935986  3.983069
  0.011548602  4.847237  0.5928790  3.990194
  0.015615982  4.892275  0.5862508  4.046665
  0.020312130  4.881305  0.5806739  4.040009
  0.025576073  4.798975  0.5945769  3.963676
  0.036213572  4.883539  0.5790766  3.986136
  0.097234710  5.401180  0.4831269  4.329337
  0.525027794  6.475054  0.4114001  5.348170

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.02557607.

According to the output, what criterion was used to pick the best submodel? What is the value of this criterion?
```
# Type your code and comments inside the code chunk
# Viewing results from mod_TR2
mod_TR2$bestTune
```
```
          cp
7 0.02557607
```
The best model has a RSME value of 4.798975 with a complexity parameter equal to 0.025576073.

# Type your code and comments inside the code chunk
# Building regression tree using rpart
mod_TR4 <- rpart(brozek_C ~ . -age -abdomen_cm, 
         data = training,
         control = rpart.control(cp = mod_TR2$bestTune$cp, xval = 50))

Use the rpart.plot() function from the rpart.plot package written by Therneau and Atkinson (2018) to graph mod_TR4.
```
# Type your code and comments inside the code chunk
# Plotting mod_TR4 with rpart.plot
rpart.plot(mod_TR4)
```
After reading the tree, the best outcome is when someone’s abdomen_wrist is greater than 71 but less than 81. When this happens there is a 37% that this tree model predicts the outcome correctly.

Compute the RMSE for mod_TR4 using the testing data set.

# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
RMSE_TR4 <- RMSE(predict(mod_TR4, testing), testing$brozek_C)
RMSE_TR4

[1] 5.254253

Use the train() function with method = "ranger", tuneLength = 10 along with the myControl as the trControl to fit a regression tree named mod_RF2. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.

# Type your code and comments inside the code chunk
# Fit a regression tree Problem 10 
# with untransformed data
set.seed(42)
mod_RF2 <- train(brozek_C ~ . -age -abdomen_cm, 
        data = training, 
        method = "ranger", 
        tuneLength = 10, 
        trControl = myControl)

Print mod_RF2 to the R console.

# Type your code and comments inside the code chunk
# Printing mod_RF2
print(mod_RF2)

Random Forest 

203 samples
 17 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
Resampling results across tuning parameters:

  mtry  splitrule   RMSE      Rsquared   MAE     
   2    variance    4.469335  0.6555022  3.693232
   2    extratrees  4.527554  0.6534191  3.725801
   3    variance    4.386340  0.6659372  3.644807
   3    extratrees  4.428709  0.6652206  3.657601
   4    variance    4.331055  0.6726188  3.603693
   4    extratrees  4.360713  0.6738456  3.610469
   6    variance    4.278248  0.6788437  3.577347
   6    extratrees  4.271360  0.6848878  3.542221
   7    variance    4.266490  0.6801720  3.567629
   7    extratrees  4.255429  0.6870425  3.538343
   9    variance    4.248479  0.6817219  3.538063
   9    extratrees  4.211828  0.6922616  3.510987
  10    variance    4.243011  0.6824179  3.541788
  10    extratrees  4.206633  0.6931284  3.505200
  12    variance    4.245644  0.6815463  3.553926
  12    extratrees  4.176676  0.6964050  3.490920
  13    variance    4.235464  0.6828409  3.540009
  13    extratrees  4.160651  0.6978126  3.473362
  15    variance    4.233937  0.6826609  3.538610
  15    extratrees  4.146548  0.6997243  3.469731

Tuning parameter 'min.node.size' was held constant at a value of 5
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 15, splitrule =
 extratrees and min.node.size = 5.

According to the output, what criterion was used to pick the best submodel? What is the value of this criterion?

# Type your code and comments inside the code chunk
# Viewing results from mod_RF2
mod_RF2$bestTune

   mtry  splitrule min.node.size
20   15 extratrees             5

The criterion used was RSME with a value of 4.146548.

Use the function RMSE() along with the predict() function to find the root mean square for the testing data.

# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
RMSE_RF2 <- RMSE(predict(mod_RF2, testing), testing$brozek_C)
RMSE_RF2

[1] 4.339785

Which model does the best job of predicting body fat?

# Type your code and comments inside the code chunk
# Creating resamples list of different models
#model_list <- list(item1 = mod_FS2, item2 = mod_BE2, item3 = mod_EN2, item4 = mod_TR2, item5 = mod_TR3, item6 = mod_TR4, item7 = mod_RF2)

#ans <- resamples(model_list)

#summary(ans)

model_list2 <- list(item1 = mod_FS2, item2 = mod_BE2, item3 = mod_EN2, item4 = mod_TR2, item7 = mod_RF2)

ans2 <- resamples(model_list2)

summary(ans2)


Call:
summary.resamples(object = ans2)

Models: item1, item2, item3, item4, item7 
Number of resamples: 50 

MAE 
          Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
item1 2.194046 3.021355 3.218821 3.284813 3.675648 4.330898    0
item2 2.397646 3.037138 3.233365 3.277038 3.554873 4.475806    0
item3 2.311599 2.953437 3.211983 3.220375 3.503270 4.264038    0
item4 2.963476 3.495434 3.918600 3.963676 4.369686 5.332065    0
item7 2.719000 3.148048 3.316040 3.469731 3.843903 4.287787    0

RMSE 
          Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
item1 2.728359 3.666489 3.928735 3.965347 4.304002 5.284396    0
item2 2.825743 3.728220 3.933744 3.940370 4.278550 5.035836    0
item3 2.772695 3.643004 3.868425 3.879209 4.141040 4.978459    0
item4 3.460347 4.340325 4.726587 4.798975 5.187490 6.432799    0
item7 3.307274 3.738256 4.034858 4.146548 4.578877 5.323989    0

Rsquared 
           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
item1 0.5297948 0.6551675 0.7351440 0.7223120 0.7755229 0.9152867    0
item2 0.5602622 0.6796893 0.7265992 0.7245924 0.7725706 0.9016342    0
item3 0.5478564 0.6854955 0.7395248 0.7334749 0.7816228 0.9013204    0
item4 0.3180474 0.5361211 0.6107806 0.5945769 0.6736918 0.7866000    0
item7 0.5014774 0.6446490 0.6992479 0.6997243 0.7556276 0.8463160    0

Type your complete sentence answer here using inline R code and delete this comment.

Which model is the most practical model for someone who needs to rapidly assess a patient’s body fat?

The tree model might be the best model for someone who needs to rapidly assess a patient’s body fat because tree models are usually the easiest to read. For someone that doesn’t understand all of these different models a tree would also be the easiest to understand because of how simple they are to read. Also, since the Random Forest model takes the longest to load, that probably wouldn’t be the most practical model for someone who needs to rapidly assess a patient’s body fat.

References

Dowle, Matt, and Arun Srinivasan. 2019. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.

Hothorn, Torsten, and Achim Zeileis. 2019. Partykit: A Toolkit for Recursive Partytioning. https://CRAN.R-project.org/package=partykit.

Johnson, Roger W. 1996. “Fitting Percentage of Body Fat to Simple Body Measurements.” Journal of Statistics Education 4 (1). doi:10.1080/10691898.1996.11910505.

Milborrow, Stephen. 2018. Rpart.plot: Plot ’Rpart’ Models: An Enhanced Version of ’Plot.rpart’. https://CRAN.R-project.org/package=rpart.plot.

Therneau, Terry, and Beth Atkinson. 2018. Rpart: Recursive Partitioning and Regression Trees. https://CRAN.R-project.org/package=rpart.

Wei, Taiyun, and Viliam Simko. 2017. Corrplot: Visualization of a Correlation Matrix. https://CRAN.R-project.org/package=corrplot.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2019. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Body Fat Project

Audrey Holloman, Sydney Singleton, Taylor Ledford, Jared Hembree

Last compiled: May 06, 2019

References