This material is released under an Attribution-NonCommercial-ShareAlike 3.0 United States license. Original author: Alan T. Arnholt

Follow all directions. Type complete sentences to answer all questions inside the answer tags provided in the R Markdown document. Round all numeric answers you report inside the answer tags to four decimal places. Use inline R code to report numeric answers inside the answer tags (i.e. do not hard code your numeric answers).

The article by Johnson (1996) defines bodyfat determined with the siri and brozek methods as well as fat free weight using equations (1), (2), and (3), respectively.

\[\begin{equation} \text{bodyfatSiri} = \frac{457}{\text{density}} - 414.2 \tag{1} \end{equation}\] \[\begin{equation} \text{bodyfatBrozek} = \frac{495}{\text{density}} - 450 \tag{2} \end{equation}\] \[\begin{equation} \text{FatFreeWeight} = \left(1 -\frac{\text{brozek}}{100}\times \text{weight_lbs}\right) \tag{3} \end{equation}\]

Body Mass Index (BMI) is defined as

\[\text{BMI} = \frac{\text{kg}}{\text{m}^2}\] Please use the following conversion factors with this project: 0.453592 kilos per pound and 2.54 centimeters per inch.

  1. Use the original data from http://jse.amstat.org/datasets/fat.dat.txt and evaluate the quality of the data. Specifically, start by using the fread() function from the data.table package written by Dowle and Srinivasan (2019) to read the data from http://jse.amstat.org/datasets/fat.dat.txt into an object named bodyfat. Pass the following vector of names to the col.names argument of fread(): c("case", "brozek", "siri", "density", "age", "weight_lbs", "height_in", "bmi", "fat_free_weight", "neck_cm", "chest_cm", "abdomen_cm", "hip_cm", "thigh_cm", "knee_cm", "ankle_cm", "biceps_cm", "forearm_cm", "wrist_cm")

    ```r
    # Type your code and comments inside the code chunk
    # Obtaining the original data
    library(data.table)
    names <- c("case", "brozek", "siri", "density", "age", "weight_lbs", "height_in", "bmi", "fat_free_weight", "neck_cm", "chest_cm", "abdomen_cm", "hip_cm", "thigh_cm", "knee_cm", "ankle_cm", "biceps_cm", "forearm_cm", "wrist_cm")
    bodyfat <- fread("http://jse.amstat.org/datasets/fat.dat.txt", col.names = names)
    ```
    • Create plotly interactive scatterplots of brozek versus density with case mapped to color, weight_lbs versus height_in with case mapped to color, and ankle_cm versus weight_lbs with case mapped to color to help identify potential outliers. How many values do you think are potentially data entry errors? Explain your reasoning and show the code you used to identify the errors.

      # Type your code and comments inside the code chunk
      # Creating interactive scatterplot of brozek versus density
      
      library(plotly)
      p <- ggplot(data = bodyfat, aes(x = density, y = brozek,
                  color = case)) +
                  geom_point() +
                  theme_bw()
      g <- ggplotly(p)
      g

      Figure 1: Plot of brozek versus density

      bodyfat[c(48, 76, 96, 42, 182, 216),
        c("density", "brozek", "siri", "height_in", "weight_lbs")]
         density brozek siri height_in weight_lbs
      1:  1.0665    6.4  5.6     71.25     148.50
      2:  1.0666   18.3 18.5     67.50     148.25
      3:  1.0991   17.3 17.4     77.75     224.50
      4:  1.0250   31.7 32.9     29.50     205.00
      5:  1.1089    0.0  0.0     68.00     118.50
      6:  0.9950   45.1 47.5     64.00     219.00
      bodyfat <- bodyfat %>% mutate(brozek_eq = (495 / density) - 450 )
      
      bodyfat<- bodyfat %>% mutate(broDiff = brozek - brozek_eq)
      
      bodyfat %>%
        filter(abs(broDiff) >2)
        case brozek siri density age weight_lbs height_in  bmi fat_free_weight
      1   48    6.4  5.6  1.0665  39     148.50     71.25 20.6           139.0
      2   76   18.3 18.5  1.0666  61     148.25     67.50 22.9           121.1
      3   96   17.3 17.4  1.0991  53     224.50     77.75 26.1           185.7
      4  182    0.0  0.0  1.1089  40     118.50     68.00 18.1           118.5
      5  216   45.1 47.5  0.9950  51     219.00     64.00 37.6           120.2
        neck_cm chest_cm abdomen_cm hip_cm thigh_cm knee_cm ankle_cm biceps_cm
      1    34.6     89.8       79.5   92.7     52.7    37.5     21.9      28.8
      2    36.0     91.6       81.8   94.8     54.5    37.0     21.4      29.3
      3    41.1    113.2       99.2  107.5     61.7    42.3     23.2      32.9
      4    33.8     79.3       69.4   85.0     47.2    33.5     20.2      27.7
      5    41.2    119.8      122.1  112.8     62.5    36.9     23.6      34.7
        forearm_cm wrist_cm  brozek_eq   broDiff
      1       26.8     17.9 14.1350211 -7.735021
      2       27.0     18.3 14.0915057  4.208494
      3       30.8     20.4  0.3684833 16.931517
      4       24.6     16.5 -3.6116873  3.611687
      5       29.1     18.4 47.4874372 -2.387437

      The potential data entry errors include the cases that lie outside of the line since the formula for brozek is a linear transformation of density. These outliers include cases 96, 76, 48, 182. Case 216 is most likely a data entry error as well. We came to these conclusions because after creating a new varible called brozek_eq, that computes what the exact brozek variable should be from the brozek equation, we found which cases had an absolute value diffrence greater than 2 between the two brozek varibels (both brozek and brozek_eq). This is shown in our code above.

      plot_ly(data = bodyfat, x = ~brozek, y = ~density,
              marker = list(size = 5,
                     color = ~case,
                     line = list(color = ~case,
                                 width = 1)))
      # Type your code and comments inside the code chunk
      # Creating interactive scatterplot of weight_lbs versus height_in
          p <- ggplot(data = bodyfat, aes(x = weight_lbs, y = height_in,
          color = case)) +
          geom_point() +
          theme_bw()
      
      g <- ggplotly(p)
      g

      Figure 2: Plot of weight_lbs versus height_in

      There are a few outliers in this graph, however, there seems to only be one data entry error. This data entry error is case 42. This case is practically impossible with a height of 29.5 inches and a wieght of 205 lbs. The rest of the data are posisble combinations of height and weight. We expect this relationship to have some variability as one isnt computed directly from the other.

      plot_ly(data = bodyfat, x = ~weight_lbs, y = ~height_in,
              marker = list(size = 5,
                     color = ~case,
                     line = list(color = ~case,
                                 width = 1)))
      # Type your code and comments inside the code chunk
      # Isolating points of interest
      
      # Points of interest for brozek vs. density graph
      bodyfat[c(48, 76, 96, 42, 182, 216),
        c("density", "brozek", "siri", "height_in", "weight_lbs" )]
          density brozek siri height_in weight_lbs
      48   1.0665    6.4  5.6     71.25     148.50
      76   1.0666   18.3 18.5     67.50     148.25
      96   1.0991   17.3 17.4     77.75     224.50
      42   1.0250   31.7 32.9     29.50     205.00
      182  1.1089    0.0  0.0     68.00     118.50
      216  0.9950   45.1 47.5     64.00     219.00
      # Points of interest for height_in vs. weight_lbs graph
      bodyfat[c(42),
        c("density", "brozek", "siri", "height_in", "weight_lbs" )]
         density brozek siri height_in weight_lbs
      42   1.025   31.7 32.9      29.5        205
      # Type your code and comments inside the code chunk
      # Replacing identified typos of density and height_in
      
      
      p <- ggplot(data = bodyfat, aes(x = density, y = brozek,
                  color = case)) +
                  geom_point() +
                  theme_bw()
      g <- ggplotly(p)
      g
      # Updating computed bodyfat values and bmi measurements
      # Type your code and comments inside the code chunk
      # Creating interactive scatterplot of ankle_cm versus weight_lbs
      
      p <- ggplot(data = bodyfat, aes(x = ankle_cm, y = weight_lbs,
          color = case)) +
          geom_point() +
          theme_bw()
      
      g <- ggplotly(p)
      g

      Figure 3: Interactive scatterplot of ankle_cm versus weight_lbs

      It looks like cases 31 and 86 could be data entry errors. Case 39 (top) is likely not an entry error, but possibly just an outlier. Cases 31 and 86 show abnormally large ankle diameters with average weights. The data entries should probably be 23.9 (case 31) and 23.7 (case 86).

      # Type your code and comments inside the code chunk
      # Creating interactive scatterplot of ankle_cm versus weight_lbs
      
      plot_ly(data = bodyfat, x = ~ankle_cm, y = ~weight_lbs,
              marker = list(size = 5,
                     color = ~case,
                     line = list(color = ~case,
                                 width = 1)))

      Figure 4: Interactive scatterplot of ankle_cm versus weight_lbs

      # Type your code and comments inside the code chunk
      # Replacing identified typos in ankle_cm
      
      bodyfat$ankle_cm[31] <- 23.9 
      bodyfat$ankle_cm[86] <- 23.7
      
      p <- ggplot(data = bodyfat, aes(x = ankle_cm, y = weight_lbs,
                  color = case)) +
                  geom_point() +
                  theme_bw()
      
      g <- ggplotly(p)
      g
      # Type your code and comments inside the code chunk
      # Identifying bodyfat typos for brozek and siri
      
       p <- ggplot(data = bodyfat, aes(x = brozek, y = siri,
          color = case)) +
          geom_point() +
          theme_bw()
      
      g <- ggplotly(p)
      g
      # Type your code and comments inside the code chunk
      # Number of rounding discrepancies for siri
      
      bodyfat<- bodyfat %>%
      mutate(siri_eq = round((457/density - 414.2),1))
      
      sum(bodyfat$siri != bodyfat$siri_eq)
      [1] 242
      # Number of rounding discrepancies for brozek
      
      sum(bodyfat$brozek != bodyfat$brozek_eq)
      [1] 252
      # Number of rounding discrepancies for bmi
      
      height_m <- (bodyfat$height_in * 2.54) / 100
      
      bodyfat<- bodyfat %>%
      mutate(bmi_eq = round(((weight_lbs * 0.453592) / ((height_m) ^ 2)),1))
      
      sum(bodyfat$bmi != bodyfat$bmi_eq)
      [1] 99

      Case 182 is a typo because you can’t have 0 body fat. Case 169 is a possible type because it doesn’t follow the linear line.

      Both of the possible typos mentioned above are most likely rounding errors. Case 182 was most likely a very very small number and ended up getting rounded to 0.


  1. Make the clean data accessible to R.

    • Load the file bodyfatClean.csv from https://github.com/alanarnholt/MISCD into your R session. Specifically, use the read.csv() function to load the file bodyfatClean.csv into your current R session naming the object cleaned_bf. Since GitHub stores the file as html, click on the raw button to obtain a *.csv file.

      # Type your code and comments inside the code chunk
      # Read in clean data
      library(dplyr)
      cleaned_bf <- read.csv("https://raw.githubusercontent.com/alanarnholt/MISCD/master/bodyfatClean.csv")
    • Use the glimpse() function from the dplyr package written by Wickham et al. (2019) to view the structure of cleaned_bf.

      # Type your code and comments inside the code chunk
      # Examining the object cleaned_bf
      glimpse(cleaned_bf)
      Observations: 251
      Variables: 18
      $ age           <int> 23, 22, 22, 26, 24, 24, 26, 25, 25, 23, 26, 27, 32…
      $ weight_lbs    <dbl> 154.25, 173.25, 154.00, 184.75, 184.25, 210.25, 18…
      $ height_in     <dbl> 67.75, 72.25, 66.25, 72.25, 71.25, 74.75, 69.75, 7…
      $ neck_cm       <dbl> 36.2, 38.5, 34.0, 37.4, 34.4, 39.0, 36.4, 37.8, 38…
      $ chest_cm      <dbl> 93.1, 93.6, 95.8, 101.8, 97.3, 104.5, 105.1, 99.6,…
      $ abdomen_cm    <dbl> 85.2, 83.0, 87.9, 86.4, 100.0, 94.4, 90.7, 88.5, 8…
      $ hip_cm        <dbl> 94.5, 98.7, 99.2, 101.2, 101.9, 107.8, 100.3, 97.1…
      $ thigh_cm      <dbl> 59.0, 58.7, 59.6, 60.1, 63.2, 66.0, 58.4, 60.0, 62…
      $ knee_cm       <dbl> 37.3, 37.3, 38.9, 37.3, 42.2, 42.0, 38.3, 39.4, 38…
      $ ankle_cm      <dbl> 21.9, 23.4, 24.0, 22.8, 24.0, 25.6, 22.9, 23.2, 23…
      $ biceps_cm     <dbl> 32.0, 30.5, 28.8, 32.4, 32.2, 35.7, 31.9, 30.5, 35…
      $ forearm_cm    <dbl> 27.4, 28.9, 25.2, 29.4, 27.7, 30.6, 27.8, 29.0, 31…
      $ wrist_cm      <dbl> 17.1, 18.2, 16.6, 18.2, 17.7, 18.8, 17.7, 18.8, 18…
      $ brozek_C      <dbl> 12.6, 6.9, 24.6, 10.9, 27.8, 20.5, 19.0, 12.7, 5.1…
      $ bmi_C         <dbl> 23.6, 23.3, 24.7, 24.9, 25.5, 26.5, 26.2, 23.5, 24…
      $ age_sq        <int> 529, 484, 484, 676, 576, 576, 676, 625, 625, 529, …
      $ abdomen_wrist <dbl> 68.1, 64.8, 71.3, 68.2, 82.3, 75.6, 73.0, 69.7, 64…
      $ am            <dbl> 181.9365, 169.1583, 195.5067, 182.7203, 190.6993, …

  1. Partition the data.

    • Use the creatDataPartition() function from the caret package to partition the data in to training and testing. Use 80% of the data for training and 20% for testing. To ensure reproducibility of the partition, use set.seed(314). The response variable you want to use is brozek_C (the computed brozek based on the reported density).

      # Type your code and comments inside the code chunk
      # Partitioning the data
      library(caret)
      set.seed(314)
      in_train <- createDataPartition(cleaned_bf$brozek_C, p = 0.8, list = FALSE)
      training <- cleaned_bf[in_train, ]
      testing <- cleaned_bf[-in_train, ]
    • Use the dim() function to verify the sizes of training and testing data sets.

      # Type your code and comments inside the code chunk
      # Verifying dimensions of training and testing
      dim(training)
      [1] 203  18
      dim(testing)
      [1] 48 18

      There are 203 observations and 18 varibles in the training dataset. There are 48 observations and 18 variables in the testing dataset.


  1. Transform the data.

    • Use the preProcess() function to transform the predictors that are in the training data set. Specifically, pass a vector with "center", "scale", and "BoxCox" to the method argument of preProcess(). Make sure not to transform the response (brozek_C).

      # Type your code and comments inside the code chunk
      # Transforming the data
      training_pp <- preProcess(training, method = c("center", "scale", "BoxCox"))
    • Use the predict() function to construct a transformed training set and a transformed testing set. Name the new transformed data sets trainingTrans and testingTrans, respectively.

      # Type your code and comments inside the code chunk
      # Creating trainingTrans and testingTrans
      trainingTrans <- predict(training_pp, training)
      testingTrans <- predict(training_pp, testing)

  2. Use the trainControl() function to define the resampling method (repeated cross-validation), the number of resampling iterations (10), and the number of repeats or complete sets to generate (5), storing the results in the object myControl.

    ```r
    # Type your code and comments inside the code chunk
    # Define the type of resampling
    myControl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
    ```

  1. Fit a linear regression model using forward stepwise selection.

    • Use the corrplot() function from the corrplot package written by Wei and Simko (2017) to identify predictors that may be linearly related in trainingTrans. Are any of the variables colinear? If so, remove the predictor that is least correlated to the response variable. Note that when method = "number" is used with corrplot(), color coded numerical correlations are displayed.

      # Type your code and comments inside the code chunk
      # Identifying linearly related predictors
      library(corrplot)
      cor <- cor(trainingTrans)
      corrplot(cor, method = "number")

      cm <- cor(x = trainingTrans$abdomen_cm, y=trainingTrans$brozek_C)
      wrist <- cor(x = trainingTrans$abdomen_wrist, y=trainingTrans$brozek_C)

      Age and age_sq are colinear, as well as abdoment_wrist and abdomen_cm.

    • Use the train() function with method = "leapForward", tuneLength = 10 and assign the object myControl to the trControl argument of the train() function to fit a forward selection model where the goal is to predict body fat. Use brozek_C as the response and store the results of train() in mod_FS. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.

      # Type your code and comments inside the code chunk
      # Fit model with forward stepwise selection
      set.seed(42)
      mod_FS <- train(brozek_C ~ . -age -abdomen_cm,
                     trainingTrans,
                     method = "leapForward",
                     tuneLength = 10,
                     trControl = myControl)
    • Print mod_FS to the R console.

      # Type your code and comments inside the code chunk
      # Printing mod_FS
      print(mod_FS)
      Linear Regression with Forward Selection 
      
      203 samples
       17 predictor
      
      No pre-processing
      Resampling: Cross-Validated (10 fold, repeated 5 times) 
      Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
      Resampling results across tuning parameters:
      
        nvmax  RMSE       Rsquared   MAE      
         2     0.5506026  0.7033709  0.4578342
         3     0.5383992  0.7165354  0.4511958
         4     0.5334791  0.7221570  0.4462905
         5     0.5356176  0.7216630  0.4476914
         6     0.5343806  0.7226373  0.4450996
         7     0.5335540  0.7249148  0.4429665
         8     0.5278222  0.7309489  0.4384376
         9     0.5287631  0.7292929  0.4399173
        10     0.5294384  0.7279965  0.4399193
        11     0.5280666  0.7291664  0.4398026
      
      RMSE was used to select the optimal model using the smallest value.
      The final value used for the model was nvmax = 8.
    • Using the output in your console, what criterion has been used to pick the best submodel? What is the value of the criterion that has been used? How many predictor variables are selected?

      # Type your code and comments inside the code chunk
      # Isolating results from mod_FS
      mod_FS$bestTune
        nvmax
      7     8

      Using RMSE with 8 predictors, the best model has a RSME value of .52878222.

    • Use the summary() function to find out which predictors are selected as the final submodel.

      # Type your code and comments inside the code chunk
      # Viewing final model
      summary(mod_FS)
      Subset selection object
      15 Variables  (and intercept)
                    Forced in Forced out
      weight_lbs        FALSE      FALSE
      height_in         FALSE      FALSE
      neck_cm           FALSE      FALSE
      chest_cm          FALSE      FALSE
      hip_cm            FALSE      FALSE
      thigh_cm          FALSE      FALSE
      knee_cm           FALSE      FALSE
      ankle_cm          FALSE      FALSE
      biceps_cm         FALSE      FALSE
      forearm_cm        FALSE      FALSE
      wrist_cm          FALSE      FALSE
      bmi_C             FALSE      FALSE
      age_sq            FALSE      FALSE
      abdomen_wrist     FALSE      FALSE
      am                FALSE      FALSE
      1 subsets of each size up to 8
      Selection Algorithm: forward
               weight_lbs height_in neck_cm chest_cm hip_cm thigh_cm knee_cm
      1  ( 1 ) " "        " "       " "     " "      " "    " "      " "    
      2  ( 1 ) "*"        " "       " "     " "      " "    " "      " "    
      3  ( 1 ) "*"        " "       " "     " "      " "    " "      " "    
      4  ( 1 ) "*"        " "       " "     " "      " "    " "      " "    
      5  ( 1 ) "*"        " "       " "     " "      " "    " "      " "    
      6  ( 1 ) "*"        " "       "*"     " "      " "    " "      " "    
      7  ( 1 ) "*"        " "       "*"     "*"      " "    " "      " "    
      8  ( 1 ) "*"        " "       "*"     "*"      " "    " "      " "    
               ankle_cm biceps_cm forearm_cm wrist_cm bmi_C age_sq abdomen_wrist
      1  ( 1 ) " "      " "       " "        " "      " "   " "    "*"          
      2  ( 1 ) " "      " "       " "        " "      " "   " "    "*"          
      3  ( 1 ) " "      " "       " "        "*"      " "   " "    "*"          
      4  ( 1 ) " "      " "       " "        "*"      " "   "*"    "*"          
      5  ( 1 ) " "      " "       " "        "*"      "*"   "*"    "*"          
      6  ( 1 ) " "      " "       " "        "*"      "*"   "*"    "*"          
      7  ( 1 ) " "      " "       " "        "*"      "*"   "*"    "*"          
      8  ( 1 ) " "      " "       " "        "*"      "*"   "*"    "*"          
               am 
      1  ( 1 ) " "
      2  ( 1 ) " "
      3  ( 1 ) " "
      4  ( 1 ) " "
      5  ( 1 ) " "
      6  ( 1 ) " "
      7  ( 1 ) " "
      8  ( 1 ) "*"

      The varibles weight_lbs, height_in, neck_cm, chest_cm, hip_cm, thigh_cm, knee_cm, ankle_cm, biceps_cm, forearm_cm, wrist_cm, bmi_C, age_sg, abdomen_wrist, and am are the predictors selected as the final submodel.

    • Compute the RMSE for mod_FS using the testing data set.

      # Type your code and comments inside the code chunk
      # Computing RMSE on the testing set
      RMSE_FS <- RMSE(predict(mod_FS, testingTrans), testingTrans$brozek_C)
      RMSE_FS
      [1] 0.6117522

  1. Fit a linear regression model using backward stepwise selection.

    • Use the train() function with method = "leapBackward", tuneLength = 10 and assign the object myControl to the trControl argument of the train() function to fit a backward elimination model where the goal is to predict body fat. Use brozek_C as the response and store the results of train() in mod_BE. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.

      # Type your code and comments inside the code chunk
      # Fit model with backwards stepwise selection
      set.seed(42)
      mod_BE <- train(brozek_C ~ . -age -abdomen_cm, 
                      trainingTrans, 
                      method = "leapBackward", 
                      tuneLength = 10, 
                      trControl = myControl)
    • Print mod_BE to the R console.

      # Type your code and comments inside the code chunk
      # Printing mod_BE
      print(mod_BE)
      Linear Regression with Backwards Selection 
      
      203 samples
       17 predictor
      
      No pre-processing
      Resampling: Cross-Validated (10 fold, repeated 5 times) 
      Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
      Resampling results across tuning parameters:
      
        nvmax  RMSE       Rsquared   MAE      
         2     0.5331715  0.7223546  0.4437283
         3     0.5262643  0.7294633  0.4357332
         4     0.5330154  0.7238807  0.4422245
         5     0.5304839  0.7267507  0.4397319
         6     0.5305875  0.7264251  0.4403598
         7     0.5307921  0.7270677  0.4404149
         8     0.5246361  0.7332446  0.4357663
         9     0.5286987  0.7294381  0.4404245
        10     0.5283786  0.7299483  0.4397340
        11     0.5279656  0.7304313  0.4402907
      
      RMSE was used to select the optimal model using the smallest value.
      The final value used for the model was nvmax = 8.
    • According to the output, what criterion has been used to pick the best submodel? What is the value of the criterion that has been used? How many predictor variables are selected?

      # Type your code and comments inside the code chunk
      # Viewing results from mod_BE
      mod_BE$bestTune
        nvmax
      7     8

      The criterion for picking the best model is the RMSE with 8 variables. The value of the RSME is 0.5246361.

    • Use the summary() function to find out which predictors are selected as the final submodel.

      # Type your code and comments inside the code chunk
      # Viewing final model
      summary(mod_BE)
      Subset selection object
      15 Variables  (and intercept)
                    Forced in Forced out
      weight_lbs        FALSE      FALSE
      height_in         FALSE      FALSE
      neck_cm           FALSE      FALSE
      chest_cm          FALSE      FALSE
      hip_cm            FALSE      FALSE
      thigh_cm          FALSE      FALSE
      knee_cm           FALSE      FALSE
      ankle_cm          FALSE      FALSE
      biceps_cm         FALSE      FALSE
      forearm_cm        FALSE      FALSE
      wrist_cm          FALSE      FALSE
      bmi_C             FALSE      FALSE
      age_sq            FALSE      FALSE
      abdomen_wrist     FALSE      FALSE
      am                FALSE      FALSE
      1 subsets of each size up to 8
      Selection Algorithm: backward
               weight_lbs height_in neck_cm chest_cm hip_cm thigh_cm knee_cm
      1  ( 1 ) " "        " "       " "     " "      " "    " "      " "    
      2  ( 1 ) " "        " "       " "     " "      " "    " "      " "    
      3  ( 1 ) " "        " "       " "     " "      " "    " "      " "    
      4  ( 1 ) " "        " "       " "     " "      " "    " "      " "    
      5  ( 1 ) " "        " "       "*"     " "      " "    " "      " "    
      6  ( 1 ) " "        " "       "*"     "*"      " "    " "      " "    
      7  ( 1 ) " "        " "       "*"     "*"      " "    " "      " "    
      8  ( 1 ) "*"        " "       "*"     "*"      " "    " "      " "    
               ankle_cm biceps_cm forearm_cm wrist_cm bmi_C age_sq abdomen_wrist
      1  ( 1 ) " "      " "       " "        " "      " "   " "    "*"          
      2  ( 1 ) " "      " "       " "        "*"      " "   " "    "*"          
      3  ( 1 ) " "      " "       " "        "*"      " "   "*"    "*"          
      4  ( 1 ) " "      " "       " "        "*"      "*"   "*"    "*"          
      5  ( 1 ) " "      " "       " "        "*"      "*"   "*"    "*"          
      6  ( 1 ) " "      " "       " "        "*"      "*"   "*"    "*"          
      7  ( 1 ) " "      " "       " "        "*"      "*"   "*"    "*"          
      8  ( 1 ) " "      " "       " "        "*"      "*"   "*"    "*"          
               am 
      1  ( 1 ) " "
      2  ( 1 ) " "
      3  ( 1 ) " "
      4  ( 1 ) " "
      5  ( 1 ) " "
      6  ( 1 ) " "
      7  ( 1 ) "*"
      8  ( 1 ) "*"

      The variables weight_lbs, height_in, neck_cm, chest_cm, hip_cm, tight_cm, knee_cm, ankle_cm, biceps_cm, forearm_cm, wrist_cm, bmi_C, age_sq, abdomen_wrist, and am are selcted as the predictors for the final submodel.

    • Compute the RMSE for mod_BE using the testing data set.

      # Type your code and comments inside the code chunk
      # Computing RMSE on the testing set
      RMSE_BE <- RMSE(predict(mod_BE, testingTrans), testingTrans$brozek_C)
      RMSE_BE
      [1] 0.6117522

  1. Fit a constrained linear regression model.

    • Use the train function with method = "glmnet" and tuneLength= 10 to fit a constrained linear regression model named mod_EN. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.

      # Type your code and comments inside the code chunk
      # Fit constrained model (elastic net)
      set.seed(42)
      mod_EN <- train(brozek_C ~ . -age_sq -abdomen_cm, 
                      data = trainingTrans, 
                      method = "glmnet", 
                      tuneLength = 10, 
                      trControl = myControl)
    • Print mod_EN to the R console.

      # Type your code and comments inside the code chunk
      # Printing mod_EN
      print(mod_EN)
      glmnet 
      
      203 samples
       17 predictor
      
      No pre-processing
      Resampling: Cross-Validated (10 fold, repeated 5 times) 
      Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
      Resampling results across tuning parameters:
      
        alpha  lambda        RMSE       Rsquared   MAE      
        0.1    0.0003820240  0.5231616  0.7334311  0.4329046
        0.1    0.0008825249  0.5231413  0.7334561  0.4328746
        0.1    0.0020387471  0.5228081  0.7338136  0.4323957
        0.1    0.0047097703  0.5223652  0.7341611  0.4317239
        0.1    0.0108801806  0.5219756  0.7344251  0.4309068
        0.1    0.0251346291  0.5234142  0.7329996  0.4317356
        0.1    0.0580642545  0.5272526  0.7298459  0.4365018
        0.1    0.1341359623  0.5378889  0.7224132  0.4476435
        0.1    0.3098714780  0.5703232  0.6959115  0.4785758
        0.1    0.7158433225  0.6267836  0.6504147  0.5257102
        0.2    0.0003820240  0.5236133  0.7330705  0.4333380
        0.2    0.0008825249  0.5230928  0.7335895  0.4327565
        0.2    0.0020387471  0.5225364  0.7340826  0.4321162
        0.2    0.0047097703  0.5220124  0.7344967  0.4314014
        0.2    0.0108801806  0.5216392  0.7347328  0.4305806
        0.2    0.0251346291  0.5233570  0.7330851  0.4323391
        0.2    0.0580642545  0.5257075  0.7320202  0.4366554
        0.2    0.1341359623  0.5400637  0.7225144  0.4520196
        0.2    0.3098714780  0.5825931  0.6894484  0.4901949
        0.2    0.7158433225  0.6468834  0.6542099  0.5420342
        0.3    0.0003820240  0.5234028  0.7332712  0.4330782
        0.3    0.0008825249  0.5229147  0.7337588  0.4325531
        0.3    0.0020387471  0.5223650  0.7342351  0.4319473
        0.3    0.0047097703  0.5217919  0.7346887  0.4311637
        0.3    0.0108801806  0.5215840  0.7347496  0.4304884
        0.3    0.0251346291  0.5230932  0.7334299  0.4328241
        0.3    0.0580642545  0.5252148  0.7332064  0.4373919
        0.3    0.1341359623  0.5446221  0.7196972  0.4572548
        0.3    0.3098714780  0.5920870  0.6850715  0.4998493
        0.3    0.7158433225  0.6704415  0.6602620  0.5592390
        0.4    0.0003820240  0.5227559  0.7339039  0.4323624
        0.4    0.0008825249  0.5225357  0.7341307  0.4321732
        0.4    0.0020387471  0.5221812  0.7344077  0.4317626
        0.4    0.0047097703  0.5216434  0.7348231  0.4309913
        0.4    0.0108801806  0.5216175  0.7347027  0.4306022
        0.4    0.0251346291  0.5226270  0.7339834  0.4330053
        0.4    0.0580642545  0.5255304  0.7334863  0.4386092
        0.4    0.1341359623  0.5486650  0.7172946  0.4613964
        0.4    0.3098714780  0.5980475  0.6871362  0.5056990
        0.4    0.7158433225  0.6931781  0.6751992  0.5805263
        0.5    0.0003820240  0.5225069  0.7341692  0.4320918
        0.5    0.0008825249  0.5223854  0.7342729  0.4320245
        0.5    0.0020387471  0.5220775  0.7344933  0.4316556
        0.5    0.0047097703  0.5215634  0.7348921  0.4308651
        0.5    0.0108801806  0.5216570  0.7346623  0.4308962
        0.5    0.0251346291  0.5221901  0.7344491  0.4332407
        0.5    0.0580642545  0.5265059  0.7328969  0.4404169
        0.5    0.1341359623  0.5550801  0.7117445  0.4676511
        0.5    0.3098714780  0.6046765  0.6891083  0.5112450
        0.5    0.7158433225  0.7216325  0.6842853  0.6065182
        0.6    0.0003820240  0.5228011  0.7339215  0.4324977
        0.6    0.0008825249  0.5224106  0.7342617  0.4320815
        0.6    0.0020387471  0.5219702  0.7345905  0.4315486
        0.6    0.0047097703  0.5214816  0.7349591  0.4307283
        0.6    0.0108801806  0.5216594  0.7346453  0.4312456
        0.6    0.0251346291  0.5218164  0.7348499  0.4336438
        0.6    0.0580642545  0.5284605  0.7312139  0.4427067
        0.6    0.1341359623  0.5630044  0.7037609  0.4744761
        0.6    0.3098714780  0.6096292  0.6937591  0.5152154
        0.6    0.7158433225  0.7546190  0.6918228  0.6344184
        0.7    0.0003820240  0.5225943  0.7340715  0.4322544
        0.7    0.0008825249  0.5223961  0.7342794  0.4320480
        0.7    0.0020387471  0.5218804  0.7346772  0.4314297
        0.7    0.0047097703  0.5214031  0.7350001  0.4305883
        0.7    0.0108801806  0.5217405  0.7345737  0.4316623
        0.7    0.0251346291  0.5215655  0.7351777  0.4339676
        0.7    0.0580642545  0.5313261  0.7284626  0.4457845
        0.7    0.1341359623  0.5681455  0.6986935  0.4790050
        0.7    0.3098714780  0.6162603  0.6955693  0.5204536
        0.7    0.7158433225  0.7881084  0.6953102  0.6610099
        0.8    0.0003820240  0.5226445  0.7340408  0.4322954
        0.8    0.0008825249  0.5223703  0.7343052  0.4320652
        0.8    0.0020387471  0.5218030  0.7347453  0.4313461
        0.8    0.0047097703  0.5213865  0.7350036  0.4305462
        0.8    0.0108801806  0.5218877  0.7344386  0.4322037
        0.8    0.0251346291  0.5216047  0.7352628  0.4344838
        0.8    0.0580642545  0.5339401  0.7258578  0.4486395
        0.8    0.1341359623  0.5708584  0.6968780  0.4813561
        0.8    0.3098714780  0.6242583  0.6950890  0.5261636
        0.8    0.7158433225  0.8211556  0.6957120  0.6881450
        0.9    0.0003820240  0.5224549  0.7342159  0.4321450
        0.9    0.0008825249  0.5223022  0.7343360  0.4319805
        0.9    0.0020387471  0.5217390  0.7347943  0.4312610
        0.9    0.0047097703  0.5213608  0.7350144  0.4305236
        0.9    0.0108801806  0.5220283  0.7342824  0.4327218
        0.9    0.0251346291  0.5218485  0.7351628  0.4351333
        0.9    0.0580642545  0.5360819  0.7237029  0.4507902
        0.9    0.1341359623  0.5731573  0.6955409  0.4829651
        0.9    0.3098714780  0.6313123  0.6950041  0.5307282
        0.9    0.7158433225  0.8599629  0.6957120  0.7207924
        1.0    0.0003820240  0.5224778  0.7342351  0.4321632
        1.0    0.0008825249  0.5222664  0.7343740  0.4319393
        1.0    0.0020387471  0.5216845  0.7348274  0.4311726
        1.0    0.0047097703  0.5212938  0.7350651  0.4304975
        1.0    0.0108801806  0.5221174  0.7341735  0.4330826
        1.0    0.0251346291  0.5224455  0.7346374  0.4360844
        1.0    0.0580642545  0.5377594  0.7220514  0.4521799
        1.0    0.1341359623  0.5747790  0.6947389  0.4838089
        1.0    0.3098714780  0.6378602  0.6957120  0.5356386
        1.0    0.7158433225  0.9060164  0.6957120  0.7594059
      
      RMSE was used to select the optimal model using the smallest value.
      The final values used for the model were alpha = 1 and lambda = 0.00470977.
    • According to the output, what criterion was used to pick the best submodel? What is the value of this criterion? Plot the object mod_EN.

      # Type your code and comments inside the code chunk
      # Viewing results from mod_EN and plotting mod_EN
      plot(mod_EN)

      Using RMSE to pick the best model, alpha has a value of 1, lambda has a value of .00470977, with RMSE having a value of .5212938.

    • Compute the RMSE for mod_EN using the testing data set.

      # Type your code and comments inside the code chunk
      # Computing RMSE on the testing set
      RMSE_EN <- RMSE(predict(mod_EN, testingTrans), testingTrans$brozek_C)
      RMSE_EN
      [1] 0.6246849

  2. Fit a regression tree.

    • Use the train() function with method = "rpart", tuneLength = 10 along with the myControl as the trControl to fit a regression tree named mod_TR. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.

      # Type your code and comments inside the code chunk
      # Fit Regression Tree
      set.seed(42)
      mod_TR <- train(brozek_C ~ . -age -abdomen_cm, 
                      data = trainingTrans, 
                      method = "rpart", 
                      tuneLength = 10, 
                      trControl = myControl)
    • Print mod_TR to the R console.

      # Type your code and comments inside the code chunk
      # Printing mod_TR
      print(mod_TR)
      CART 
      
      203 samples
       17 predictor
      
      No pre-processing
      Resampling: Cross-Validated (10 fold, repeated 5 times) 
      Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
      Resampling results across tuning parameters:
      
        cp           RMSE       Rsquared   MAE      
        0.007703033  0.6636885  0.5851521  0.5433992
        0.008433471  0.6625036  0.5849659  0.5434520
        0.010116937  0.6542281  0.5939651  0.5376865
        0.011548602  0.6543465  0.5930476  0.5386454
        0.015615982  0.6604265  0.5864218  0.5462687
        0.020312130  0.6588497  0.5808712  0.5451845
        0.025576073  0.6477331  0.5946999  0.5348745
        0.036213572  0.6592349  0.5789596  0.5380807
        0.097234710  0.7291152  0.4830263  0.5844113
        0.525027794  0.8741036  0.4114001  0.7219792
      
      RMSE was used to select the optimal model using the smallest value.
      The final value used for the model was cp = 0.02557607.
    • According to the output, what criterion was used to pick the best submodel? What is the value of this criterion?

      # Type your code and comments inside the code chunk
      # Viewing results from mod_TR 

      Using RSME to determine the best model, the complexity value of the model is .02557607 with an RMSE of 0.6477331.

    • Use the rpart() function from the rpart package written by Therneau and Atkinson (2018) to build the regression tree using the complexity parameter (cp) value from mod_TR above. Name this tree mod_TR3.

      # Type your code and comments inside the code chunk
      # Building regression tree using rpart
      library(rpart)
      mod_TR3 <- rpart(brozek_C ~ . -age -abdomen_cm, 
                       data = trainingTrans,
                       control = rpart.control(cp = mod_TR$bestTune$cp, xval = 50))
    • Use the plot() function from the partykit package written by Hothorn and Zeileis (2019) to graph mod_TR3.

      # Type your code and comments inside the code chunk
      # Plotting mod_TR3 with partykit
      library(partykit)
      plot(as.party(mod_TR3))

    • Use the rpart.plot() function from the rpart.plot package written by Milborrow (2018) to graph mod_TR3.

      # Type your code and comments inside the code chunk
      # Plotting mod_TR3 with rpart.plot
      library(rpart.plot)
      rpart.plot(mod_TR3)

    • What predictors are used in the graph of mod_TR3?

      The predictor used in the graph of mod_TR3 is the variable abdomen_wrist.

    • Explain the tree.

      This tree says that if someone as a abdomen_wrist less that -0.3 and less than -0.9 than there is a 20% chance that person’s abdomen_wrist tree predictions are correct. While on the other hand, if someone has a abdomen_wrist greater than -0.3, an abdomen_wrist greater than 0.77, and an abdomen_wrist greater than 1.8 there is a 3% chance that the this person’s abdomen_wrist tree predictions are correct.

    • According to the tree, the abdomen_wrist measurements can be negative. Is this possible? If so, explain the reason for the negative values.

      The variable abdomen_wrist is equal to abdomen_cm - wrist_cm, so if abdomen wrist is negative it’s because wrist_cm is larger than abdomen_cm which is possible.

    • Compute the RMSE for mod_TR3 using the testing data set.

      # Type your code and comments inside the code chunk
      # Computing RMSE on the testing set
      RMSE_TR <- RMSE(predict(mod_TR3, testingTrans), testingTrans$brozek_C)
      RMSE_TR
      [1] 0.7093008

  3. Fit a Random Forest model.

    • Use the train() function with method = "ranger", tuneLength = 10 along with the myControl as the trControl to fit a regression tree named mod_RF. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.

      # Type your code and comments inside the code chunk
      # Fit Random Forest model
      set.seed(42)
      mod_RF <- train(brozek_C ~ . -age -abdomen_cm, 
              data = trainingTrans, 
              method = "ranger", 
              tuneLength = 10, 
              trControl = myControl)
    • Print mod_RF to the R console.

      # Type your code and comments inside the code chunk
      # Printing mod_RF
      print(mod_RF)
      Random Forest 
      
      203 samples
       17 predictor
      
      No pre-processing
      Resampling: Cross-Validated (10 fold, repeated 5 times) 
      Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
      Resampling results across tuning parameters:
      
        mtry  splitrule   RMSE       Rsquared   MAE      
         2    variance    0.6033529  0.6553213  0.4990980
         2    extratrees  0.6109769  0.6535320  0.5042730
         3    variance    0.5921607  0.6657163  0.4922751
         3    extratrees  0.5968306  0.6666102  0.4925443
         4    variance    0.5845042  0.6728471  0.4863736
         4    extratrees  0.5882696  0.6749139  0.4869029
         6    variance    0.5774562  0.6787746  0.4826502
         6    extratrees  0.5781233  0.6835980  0.4805376
         7    variance    0.5759719  0.6800426  0.4812854
         7    extratrees  0.5741989  0.6874072  0.4772021
         9    variance    0.5732636  0.6821994  0.4774872
         9    extratrees  0.5693255  0.6917845  0.4752408
        10    variance    0.5719643  0.6832542  0.4774101
        10    extratrees  0.5683003  0.6925570  0.4733990
        12    variance    0.5723502  0.6822282  0.4790659
        12    extratrees  0.5641502  0.6957053  0.4717005
        13    variance    0.5710778  0.6833039  0.4770955
        13    extratrees  0.5632266  0.6970135  0.4702682
        15    variance    0.5707742  0.6833333  0.4768475
        15    extratrees  0.5596436  0.6994116  0.4676028
      
      Tuning parameter 'min.node.size' was held constant at a value of 5
      RMSE was used to select the optimal model using the smallest value.
      The final values used for the model were mtry = 15, splitrule =
       extratrees and min.node.size = 5.
    • According to the output, what criterion was used to pick the best submodel? What is the value of this criterion?

      # Type your code and comments inside the code chunk
      # Viewing results from mod_RF
      mod_RF$bestTune
         mtry  splitrule min.node.size
      20   15 extratrees             5

      Using RMSE with mtry = 5, the best model has a RSME value of 0.5596436.

    • Use the function RMSE along with the predict function to find the root mean square for the testing data.

      # Type your code and comments inside the code chunk
      # Computing RMSE on the testing set
      RMSE_RF <- RMSE(predict(mod_RF, testingTrans), testingTrans$brozek_C)
      RMSE_RF
      [1] 0.5873907

  1. Among the models created from Problem 6 - Problem 10 (mod_FS, mod_BE, mod_EN, mod_TR, and mod_RF), which do you think is best for predicting body fat and why?

    ```r
    # Type your code and comments inside the code chunk
    # Creating resamples list named mods
    RMSE_all <- c(RMSE_FS, RMSE_BE, RMSE_EN, RMSE_TR, RMSE_RF)
    RMSE_all
    ```
    
    ```
    [1] 0.6117522 0.6117522 0.6246849 0.7093008 0.5873907
    ```

    Among the models created from Problem 6 - Problem 10, it seems like the best model for predicting body fat is the Random Forest model, mod_RF. We think this because its RMSE is the lowest out of all of the model’s RMSE values.


  1. Many statistical algorithms work better on transformed variables; however, the user whether a nurse, physical therapist, or physician should be able to use your proposed model without resorting to a spreadsheet or calculator. Consequently, no transformation will take place in the models you will fit in this question. Repeat Problem 6 through Problem 10 using the untransformed data in training and testing you created in Problem 3. Make sure to give new names to your new models that use the un-transformed data.

    • Use the corrplot() function from the corrplot package written by Wei and Simko (2017) to identify predictors that may be linearly related in training.

      # Type your code and comments inside the code chunk
      # Identifying linearly related predictors Problem 6
      cor <- cor(training)
      corrplot(cor, method = "number")
      cm <- cor(x = training$abdomen_cm, y=training$brozek_C)
      wrist <- cor(x = training$abdomen_wrist, y=training$brozek_C)
    • Use the train() function with method = "leapForward", tuneLength = 10 and assign the object myControl to the trControl argument of the train() function to fit a forward selection model where the goal is to predict body fat. Use brozek_C as the response and store the results of train() in mod_FS2. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.

      # Type your code and comments inside the code chunk
      # Fit model with forward stepwise selection
      set.seed(42)
      mod_FS2 <- train(brozek_C ~ . -age -abdomen_cm,
             data = training,
             method = "leapForward",
             tuneLength = 10,
             trControl = myControl)
    • Print mod_FS2 to the R console.

      # Type your code and comments inside the code chunk
      # Printing mod_FS2
      print(mod_FS2)
      Linear Regression with Forward Selection 
      
      203 samples
       17 predictor
      
      No pre-processing
      Resampling: Cross-Validated (10 fold, repeated 5 times) 
      Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
      Resampling results across tuning parameters:
      
        nvmax  RMSE      Rsquared   MAE     
         2     3.975510  0.7172364  3.328440
         3     4.013375  0.7136037  3.350542
         4     4.013338  0.7128407  3.356147
         5     4.048652  0.7087094  3.369878
         6     4.049642  0.7092520  3.365529
         7     4.017277  0.7142455  3.346466
         8     4.009138  0.7156374  3.330654
         9     3.976430  0.7204939  3.292605
        10     3.966893  0.7225691  3.290474
        11     3.965347  0.7223120  3.284813
      
      RMSE was used to select the optimal model using the smallest value.
      The final value used for the model was nvmax = 11.
    • Using the output in your console, what criterion has been used to pick the best submodel? What is the value of the criterion that has been used? How many predictor variables are selected?

      # Type your code and comments inside the code chunk
      # Isolating results from mod_FS2
      mod_FS2$bestTune
         nvmax
      10    11

      Using RMSE with nvmax = 11, the best model has a RSME value of 0.5596436.

    • Use the train() function with method = "leapBackward", tuneLength = 10 and assign the object myControl to the trControl argument of the train() function to fit a backward elimination model where the goal is to predict body fat. Use brozek_C as the response and store the results of train() in mod_BE2. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.

      # Type your code and comments inside the code chunk
      # Fit model with backwards stepwise selection Problem 7 
      # with untransformed data
      set.seed(42)
      mod_BE2 <- train(brozek_C ~ . -age -abdomen_cm, 
                      data = training, 
                      method = "leapBackward", 
                      tuneLength = 10, 
                      trControl = myControl)
    • Print mod_BE2 to the R console.

      # Type your code and comments inside the code chunk
      # Printing mod_BE2
      print(mod_BE2)
      Linear Regression with Backwards Selection 
      
      203 samples
       17 predictor
      
      No pre-processing
      Resampling: Cross-Validated (10 fold, repeated 5 times) 
      Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
      Resampling results across tuning parameters:
      
        nvmax  RMSE      Rsquared   MAE     
         2     4.002785  0.7160759  3.322765
         3     3.940370  0.7245924  3.277038
         4     4.024413  0.7137547  3.356219
         5     3.975481  0.7210802  3.313074
         6     3.997598  0.7180677  3.318327
         7     4.024611  0.7153761  3.338106
         8     4.017507  0.7168306  3.334540
         9     4.008288  0.7181004  3.325982
        10     3.996402  0.7191567  3.315898
        11     3.985836  0.7195863  3.304527
      
      RMSE was used to select the optimal model using the smallest value.
      The final value used for the model was nvmax = 3.
    • According to the output, what criterion has been used to pick the best submodel? What is the value of the criterion that has been used? How many predictor variables are selected?

      # Type your code and comments inside the code chunk
      # Viewing results from mod_BE
      print(mod_BE)
      Linear Regression with Backwards Selection 
      
      203 samples
       17 predictor
      
      No pre-processing
      Resampling: Cross-Validated (10 fold, repeated 5 times) 
      Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
      Resampling results across tuning parameters:
      
        nvmax  RMSE       Rsquared   MAE      
         2     0.5331715  0.7223546  0.4437283
         3     0.5262643  0.7294633  0.4357332
         4     0.5330154  0.7238807  0.4422245
         5     0.5304839  0.7267507  0.4397319
         6     0.5305875  0.7264251  0.4403598
         7     0.5307921  0.7270677  0.4404149
         8     0.5246361  0.7332446  0.4357663
         9     0.5286987  0.7294381  0.4404245
        10     0.5283786  0.7299483  0.4397340
        11     0.5279656  0.7304313  0.4402907
      
      RMSE was used to select the optimal model using the smallest value.
      The final value used for the model was nvmax = 8.
      # Viewing final model
      mod_BE2$bestTune
        nvmax
      2     3

      Using RMSE with nvmax = 8, the best model has a RSME value of 0.5246361.

    • Compute the RMSE for mod_BE2 using the testing data set.

      # Type your code and comments inside the code chunk
      # Computing RMSE on the testing set
      RMSE_BE2 <- RMSE(predict(mod_BE2, testing), testing$brozek_C)
      RMSE_BE2
      [1] 5.045508
    • Use the train function with method = "glmnet" and tuneLength = 10 to fit a constrained linear regression model named mod_EN2. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.

      # Type your code and comments inside the code chunk
      # Fit constrained model Problem 8 
      # with untransformed data
      set.seed(42)
      mod_EN2 <- train(brozek_C ~ . -age -abdomen_cm, 
                      data = training, 
                      method = "glmnet", 
                      tuneLength = 10, 
                      trControl = myControl)
    • Print the mod_EN2 to the R console.

      # Type your code and comments inside the code chunk
      # Printing mod_EN2
      print(mod_EN2)
      glmnet 
      
      203 samples
       17 predictor
      
      No pre-processing
      Resampling: Cross-Validated (10 fold, repeated 5 times) 
      Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
      Resampling results across tuning parameters:
      
        alpha  lambda       RMSE      Rsquared   MAE     
        0.1    0.002826336  3.898212  0.7306816  3.245262
        0.1    0.006529204  3.897933  0.7307220  3.245058
        0.1    0.015083308  3.895545  0.7311403  3.241223
        0.1    0.034844398  3.891779  0.7317991  3.234050
        0.1    0.080495081  3.884222  0.7329876  3.221861
        0.1    0.185954082  3.889032  0.7324875  3.217446
        0.1    0.429578060  3.933562  0.7273019  3.273984
        0.1    0.992381059  4.021400  0.7185261  3.361966
        0.1    2.292529014  4.258896  0.6919319  3.578695
        0.1    5.296039497  4.668823  0.6471909  3.903705
        0.2    0.002826336  3.903327  0.7301773  3.248850
        0.2    0.006529204  3.902239  0.7303746  3.247602
        0.2    0.015083308  3.898259  0.7309218  3.242272
        0.2    0.034844398  3.890639  0.7320320  3.232161
        0.2    0.080495081  3.881648  0.7333300  3.218666
        0.2    0.185954082  3.891880  0.7321305  3.224696
        0.2    0.429578060  3.928085  0.7285089  3.282415
        0.2    0.992381059  4.040690  0.7178424  3.393632
        0.2    2.292529014  4.349818  0.6849613  3.654502
        0.2    5.296039497  4.813257  0.6520217  4.024312
        0.3    0.002826336  3.906153  0.7298036  3.249005
        0.3    0.006529204  3.904771  0.7300055  3.247757
        0.3    0.015083308  3.898637  0.7308179  3.241826
        0.3    0.034844398  3.887696  0.7324621  3.229256
        0.3    0.080495081  3.879801  0.7335202  3.216224
        0.3    0.185954082  3.896331  0.7315477  3.235838
        0.3    0.429578060  3.928801  0.7289281  3.293221
        0.3    0.992381059  4.066518  0.7159738  3.422237
        0.3    2.292529014  4.400627  0.6838901  3.704633
        0.3    5.296039497  4.983582  0.6586167  4.171876
        0.4    0.002826336  3.903819  0.7301210  3.248358
        0.4    0.006529204  3.903426  0.7301710  3.247816
        0.4    0.015083308  3.898168  0.7309531  3.241826
        0.4    0.034844398  3.885125  0.7328299  3.226894
        0.4    0.080495081  3.879460  0.7335013  3.215697
        0.4    0.185954082  3.899541  0.7311169  3.245998
        0.4    0.429578060  3.934357  0.7286345  3.306616
        0.4    0.992381059  4.092884  0.7138247  3.444701
        0.4    2.292529014  4.444713  0.6859326  3.747502
        0.4    5.296039497  5.149917  0.6736927  4.319777
        0.5    0.002826336  3.903505  0.7302286  3.247733
        0.5    0.006529204  3.902470  0.7303571  3.246765
        0.5    0.015083308  3.897127  0.7311317  3.240569
        0.5    0.034844398  3.883187  0.7330516  3.224895
        0.5    0.080495081  3.880840  0.7332664  3.217132
        0.5    0.185954082  3.903901  0.7305478  3.254614
        0.5    0.429578060  3.945139  0.7274254  3.321369
        0.5    0.992381059  4.139961  0.7080266  3.484672
        0.5    2.292529014  4.492848  0.6880800  3.788181
        0.5    5.296039497  5.359581  0.6829231  4.507267
        0.6    0.002826336  3.903276  0.7302020  3.248133
        0.6    0.006529204  3.900919  0.7304966  3.246190
        0.6    0.015083308  3.896081  0.7312840  3.239674
        0.6    0.034844398  3.882047  0.7331546  3.223683
        0.6    0.080495081  3.883016  0.7329796  3.219828
        0.6    0.185954082  3.907795  0.7300138  3.262730
        0.6    0.429578060  3.956338  0.7260768  3.333083
        0.6    0.992381059  4.193919  0.7005900  3.530124
        0.6    2.292529014  4.529196  0.6924889  3.821134
        0.6    5.296039497  5.601769  0.6908634  4.710635
        0.7    0.002826336  3.904127  0.7299778  3.249239
        0.7    0.006529204  3.901454  0.7303555  3.247018
        0.7    0.015083308  3.892907  0.7317257  3.236682
        0.7    0.034844398  3.880630  0.7333266  3.222263
        0.7    0.080495081  3.885852  0.7325881  3.224287
        0.7    0.185954082  3.910763  0.7295979  3.269390
        0.7    0.429578060  3.969058  0.7243910  3.344746
        0.7    0.992381059  4.222578  0.6970383  3.553660
        0.7    2.292529014  4.579203  0.6938678  3.860845
        0.7    5.296039497  5.843028  0.6940269  4.903381
        0.8    0.002826336  3.906831  0.7294851  3.250907
        0.8    0.006529204  3.903079  0.7301270  3.248279
        0.8    0.015083308  3.890867  0.7320087  3.234609
        0.8    0.034844398  3.879567  0.7334441  3.220952
        0.8    0.080495081  3.888609  0.7321969  3.228650
        0.8    0.185954082  3.912692  0.7292777  3.275092
        0.8    0.429578060  3.982618  0.7224672  3.356549
        0.8    0.992381059  4.242224  0.6953071  3.568179
        0.8    2.292529014  4.637208  0.6933527  3.903227
        0.8    5.296039497  6.086977  0.6942750  5.103765
        0.9    0.002826336  3.907243  0.7295920  3.251813
        0.9    0.006529204  3.904005  0.7300972  3.248767
        0.9    0.015083308  3.888892  0.7322782  3.232773
        0.9    0.034844398  3.879209  0.7334749  3.220375
        0.9    0.080495081  3.891159  0.7318504  3.232807
        0.9    0.185954082  3.914933  0.7289410  3.280292
        0.9    0.429578060  3.994135  0.7208662  3.364936
        0.9    0.992381059  4.258793  0.6939615  3.578897
        0.9    2.292529014  4.684735  0.6939008  3.936921
        0.9    5.296039497  6.373696  0.6942750  5.345038
        1.0    0.002826336  3.907552  0.7295530  3.252845
        1.0    0.006529204  3.905273  0.7299595  3.249799
        1.0    0.015083308  3.887443  0.7324695  3.231457
        1.0    0.034844398  3.879537  0.7334220  3.220381
        1.0    0.080495081  3.892855  0.7316197  3.236196
        1.0    0.185954082  3.916939  0.7286731  3.284792
        1.0    0.429578060  4.005379  0.7194001  3.373151
        1.0    0.992381059  4.268706  0.6933990  3.583624
        1.0    2.292529014  4.734330  0.6942750  3.979685
        1.0    5.296039497  6.713941  0.6942750  5.629564
      
      RMSE was used to select the optimal model using the smallest value.
      The final values used for the model were alpha = 0.9 and lambda
       = 0.0348444.
    • According to the output, what criterion was used to pick the best submodel? What is the value of this criterion? Plot the object mod_EN2.

      # Type your code and comments inside the code chunk
      # Viewing results from mod_EN2
      mod_EN2$bestTune
         alpha    lambda
      84   0.9 0.0348444

      Type your complete sentence answer here using inline R code and delete this comment.

    • Compute the RMSE for mod_EN2 using the testing data set.

      # Type your code and comments inside the code chunk
      # Computing RMSE on the testing set
      RMSE_EN2 <- RMSE(predict(mod_EN2, testing), testing$brozek_C)
      RMSE_EN2
      [1] 5.257586
    • Use the train() function with method = "rpart", tuneLength = 10 along with the myControl as the trControl to fit a regression tree named mod_TR2. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.

      # Type your code and comments inside the code chunk
      # Fit a regression tree Problem 9 
      # with untransformed data
      set.seed(42)
      mod_TR2 <- train(brozek_C ~ . -age -abdomen_cm, 
              data = training, 
              method = "rpart", 
              tuneLength = 10, 
              trControl = myControl)
    • Print mod_TR2 to the R console.

      # Type your code and comments inside the code chunk
      # Printing mod_TR2
      print(mod_TR2)
      CART 
      
      203 samples
       17 predictor
      
      No pre-processing
      Resampling: Cross-Validated (10 fold, repeated 5 times) 
      Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
      Resampling results across tuning parameters:
      
        cp           RMSE      Rsquared   MAE     
        0.007703033  4.916433  0.5847639  4.025387
        0.008433471  4.907655  0.5845893  4.025778
        0.010116937  4.846352  0.5935986  3.983069
        0.011548602  4.847237  0.5928790  3.990194
        0.015615982  4.892275  0.5862508  4.046665
        0.020312130  4.881305  0.5806739  4.040009
        0.025576073  4.798975  0.5945769  3.963676
        0.036213572  4.883539  0.5790766  3.986136
        0.097234710  5.401180  0.4831269  4.329337
        0.525027794  6.475054  0.4114001  5.348170
      
      RMSE was used to select the optimal model using the smallest value.
      The final value used for the model was cp = 0.02557607.
    • According to the output, what criterion was used to pick the best submodel? What is the value of this criterion?

      # Type your code and comments inside the code chunk
      # Viewing results from mod_TR2
      mod_TR2$bestTune
                cp
      7 0.02557607

      The best model has a RSME value of 4.798975 with a complexity parameter equal to 0.025576073.

    • Use the rpart() function from the rpart package written by Therneau and Atkinson (2018) to build the regression tree using the complexity parameter (cp) value from mod_TR2 above. Name this tree mod_TR4.

      # Type your code and comments inside the code chunk
      # Building regression tree using rpart
      mod_TR4 <- rpart(brozek_C ~ . -age -abdomen_cm, 
               data = training,
               control = rpart.control(cp = mod_TR2$bestTune$cp, xval = 50))
    • Use the rpart.plot() function from the rpart.plot package written by Therneau and Atkinson (2018) to graph mod_TR4.

      # Type your code and comments inside the code chunk
      # Plotting mod_TR4 with rpart.plot
      rpart.plot(mod_TR4)

      After reading the tree, the best outcome is when someone’s abdomen_wrist is greater than 71 but less than 81. When this happens there is a 37% that this tree model predicts the outcome correctly.

    • Compute the RMSE for mod_TR4 using the testing data set.

      # Type your code and comments inside the code chunk
      # Computing RMSE on the testing set
      RMSE_TR4 <- RMSE(predict(mod_TR4, testing), testing$brozek_C)
      RMSE_TR4
      [1] 5.254253
    • Use the train() function with method = "ranger", tuneLength = 10 along with the myControl as the trControl to fit a regression tree named mod_RF2. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.

      # Type your code and comments inside the code chunk
      # Fit a regression tree Problem 10 
      # with untransformed data
      set.seed(42)
      mod_RF2 <- train(brozek_C ~ . -age -abdomen_cm, 
              data = training, 
              method = "ranger", 
              tuneLength = 10, 
              trControl = myControl)
    • Print mod_RF2 to the R console.

      # Type your code and comments inside the code chunk
      # Printing mod_RF2
      print(mod_RF2)
      Random Forest 
      
      203 samples
       17 predictor
      
      No pre-processing
      Resampling: Cross-Validated (10 fold, repeated 5 times) 
      Summary of sample sizes: 183, 182, 181, 183, 184, 183, ... 
      Resampling results across tuning parameters:
      
        mtry  splitrule   RMSE      Rsquared   MAE     
         2    variance    4.469335  0.6555022  3.693232
         2    extratrees  4.527554  0.6534191  3.725801
         3    variance    4.386340  0.6659372  3.644807
         3    extratrees  4.428709  0.6652206  3.657601
         4    variance    4.331055  0.6726188  3.603693
         4    extratrees  4.360713  0.6738456  3.610469
         6    variance    4.278248  0.6788437  3.577347
         6    extratrees  4.271360  0.6848878  3.542221
         7    variance    4.266490  0.6801720  3.567629
         7    extratrees  4.255429  0.6870425  3.538343
         9    variance    4.248479  0.6817219  3.538063
         9    extratrees  4.211828  0.6922616  3.510987
        10    variance    4.243011  0.6824179  3.541788
        10    extratrees  4.206633  0.6931284  3.505200
        12    variance    4.245644  0.6815463  3.553926
        12    extratrees  4.176676  0.6964050  3.490920
        13    variance    4.235464  0.6828409  3.540009
        13    extratrees  4.160651  0.6978126  3.473362
        15    variance    4.233937  0.6826609  3.538610
        15    extratrees  4.146548  0.6997243  3.469731
      
      Tuning parameter 'min.node.size' was held constant at a value of 5
      RMSE was used to select the optimal model using the smallest value.
      The final values used for the model were mtry = 15, splitrule =
       extratrees and min.node.size = 5.
    • According to the output, what criterion was used to pick the best submodel? What is the value of this criterion?

      # Type your code and comments inside the code chunk
      # Viewing results from mod_RF2
      mod_RF2$bestTune
         mtry  splitrule min.node.size
      20   15 extratrees             5

      The criterion used was RSME with a value of 4.146548.

    • Use the function RMSE() along with the predict() function to find the root mean square for the testing data.

      # Type your code and comments inside the code chunk
      # Computing RMSE on the testing set
      RMSE_RF2 <- RMSE(predict(mod_RF2, testing), testing$brozek_C)
      RMSE_RF2
      [1] 4.339785
    • Which model does the best job of predicting body fat?

      # Type your code and comments inside the code chunk
      # Creating resamples list of different models
      #model_list <- list(item1 = mod_FS2, item2 = mod_BE2, item3 = mod_EN2, item4 = mod_TR2, item5 = mod_TR3, item6 = mod_TR4, item7 = mod_RF2)
      
      #ans <- resamples(model_list)
      
      #summary(ans)
      
      model_list2 <- list(item1 = mod_FS2, item2 = mod_BE2, item3 = mod_EN2, item4 = mod_TR2, item7 = mod_RF2)
      
      ans2 <- resamples(model_list2)
      
      summary(ans2)
      
      Call:
      summary.resamples(object = ans2)
      
      Models: item1, item2, item3, item4, item7 
      Number of resamples: 50 
      
      MAE 
                Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
      item1 2.194046 3.021355 3.218821 3.284813 3.675648 4.330898    0
      item2 2.397646 3.037138 3.233365 3.277038 3.554873 4.475806    0
      item3 2.311599 2.953437 3.211983 3.220375 3.503270 4.264038    0
      item4 2.963476 3.495434 3.918600 3.963676 4.369686 5.332065    0
      item7 2.719000 3.148048 3.316040 3.469731 3.843903 4.287787    0
      
      RMSE 
                Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
      item1 2.728359 3.666489 3.928735 3.965347 4.304002 5.284396    0
      item2 2.825743 3.728220 3.933744 3.940370 4.278550 5.035836    0
      item3 2.772695 3.643004 3.868425 3.879209 4.141040 4.978459    0
      item4 3.460347 4.340325 4.726587 4.798975 5.187490 6.432799    0
      item7 3.307274 3.738256 4.034858 4.146548 4.578877 5.323989    0
      
      Rsquared 
                 Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
      item1 0.5297948 0.6551675 0.7351440 0.7223120 0.7755229 0.9152867    0
      item2 0.5602622 0.6796893 0.7265992 0.7245924 0.7725706 0.9016342    0
      item3 0.5478564 0.6854955 0.7395248 0.7334749 0.7816228 0.9013204    0
      item4 0.3180474 0.5361211 0.6107806 0.5945769 0.6736918 0.7866000    0
      item7 0.5014774 0.6446490 0.6992479 0.6997243 0.7556276 0.8463160    0

      Type your complete sentence answer here using inline R code and delete this comment.

    • Which model is the most practical model for someone who needs to rapidly assess a patient’s body fat?

      The tree model might be the best model for someone who needs to rapidly assess a patient’s body fat because tree models are usually the easiest to read. For someone that doesn’t understand all of these different models a tree would also be the easiest to understand because of how simple they are to read. Also, since the Random Forest model takes the longest to load, that probably wouldn’t be the most practical model for someone who needs to rapidly assess a patient’s body fat.


References

Dowle, Matt, and Arun Srinivasan. 2019. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.

Hothorn, Torsten, and Achim Zeileis. 2019. Partykit: A Toolkit for Recursive Partytioning. https://CRAN.R-project.org/package=partykit.

Johnson, Roger W. 1996. “Fitting Percentage of Body Fat to Simple Body Measurements.” Journal of Statistics Education 4 (1). doi:10.1080/10691898.1996.11910505.

Milborrow, Stephen. 2018. Rpart.plot: Plot ’Rpart’ Models: An Enhanced Version of ’Plot.rpart’. https://CRAN.R-project.org/package=rpart.plot.

Therneau, Terry, and Beth Atkinson. 2018. Rpart: Recursive Partitioning and Regression Trees. https://CRAN.R-project.org/package=rpart.

Wei, Taiyun, and Viliam Simko. 2017. Corrplot: Visualization of a Correlation Matrix. https://CRAN.R-project.org/package=corrplot.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2019. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.