Ideas for Grupo4

Read in Data

seeds <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt", col.names = c("area", "perimemter", "compactness", "lengthkernel", "widthkernel", "asymcoef", "lengthgroove", "kernnumber"))

#Create New Variables for each Kernal Size
library(tidyverse)
seeds <- seeds %>%
  mutate(kernname = case_when(kernnumber == 1 ~ "Kama", kernnumber == 2 ~ "Rosa", kernnumber == 3 ~ "Canadian")) 
seeds$kernname <- factor(seeds$kernname)
summary(seeds)

      area         perimemter     compactness      lengthkernel  
 Min.   :10.59   Min.   :12.41   Min.   :0.8081   Min.   :4.899  
 1st Qu.:12.27   1st Qu.:13.45   1st Qu.:0.8569   1st Qu.:5.262  
 Median :14.36   Median :14.32   Median :0.8734   Median :5.524  
 Mean   :14.85   Mean   :14.56   Mean   :0.8710   Mean   :5.629  
 3rd Qu.:17.30   3rd Qu.:15.71   3rd Qu.:0.8878   3rd Qu.:5.980  
 Max.   :21.18   Max.   :17.25   Max.   :0.9183   Max.   :6.675  
  widthkernel       asymcoef       lengthgroove     kernnumber
 Min.   :2.630   Min.   :0.7651   Min.   :4.519   Min.   :1   
 1st Qu.:2.944   1st Qu.:2.5615   1st Qu.:5.045   1st Qu.:1   
 Median :3.237   Median :3.5990   Median :5.223   Median :2   
 Mean   :3.259   Mean   :3.7002   Mean   :5.408   Mean   :2   
 3rd Qu.:3.562   3rd Qu.:4.7687   3rd Qu.:5.877   3rd Qu.:3   
 Max.   :4.033   Max.   :8.4560   Max.   :6.550   Max.   :3   
     kernname 
 Canadian:70  
 Kama    :70  
 Rosa    :70

Partition Data

set.seed(934)
in_train <- createDataPartition(y = seeds$kernname,
                                p = 0.70,
                                list = FALSE)
training <- seeds[in_train, ]
testing  <- seeds[-in_train, ]
dim(training)

[1] 147   9

dim(testing)

[1] 63  9

Train Model to find \(c_p\)

set.seed(21)
tree_model <- train(kernname ~ . -kernnumber, 
                    data = training,
                    method = "rpart",
                    tuneLength = 10,
                    trControl = trainControl(method = "repeatedcv",
                                             number = 10,
                                             repeats = 5)
                    )
tree_model

CART 

147 samples
  8 predictor
  3 classes: 'Canadian', 'Kama', 'Rosa' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 133, 132, 132, 132, 133, 132, ... 
Resampling results across tuning parameters:

  cp          Accuracy   Kappa     
  0.00000000  0.8884982  0.83268059
  0.05328798  0.8952747  0.84277805
  0.10657596  0.8994652  0.84901972
  0.15986395  0.8994652  0.84901972
  0.21315193  0.8994652  0.84901972
  0.26643991  0.8994652  0.84901972
  0.31972789  0.8994652  0.84901972
  0.37301587  0.8994652  0.84901972
  0.42630385  0.6501612  0.47596566
  0.47959184  0.3819927  0.08716224

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.3730159.

tree_model$bestTune

         cp
8 0.3730159

seeds_tree <- rpart(kernname ~ . - kernnumber, data = training, method = "class", cp = tree_model$bestTune)
rpart.plot::rpart.plot(seeds_tree, type = 1)

Prediction

class_prediction <- predict(tree_model, newdata = testing, type = "raw")
confusionMatrix(class_prediction, testing$kernname)

Confusion Matrix and Statistics

          Reference
Prediction Canadian Kama Rosa
  Canadian       21    6    0
  Kama            0   15    0
  Rosa            0    0   21

Overall Statistics
                                          
               Accuracy : 0.9048          
                 95% CI : (0.8041, 0.9642)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8571          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Canadian Class: Kama Class: Rosa
Sensitivity                   1.0000      0.7143      1.0000
Specificity                   0.8571      1.0000      1.0000
Pos Pred Value                0.7778      1.0000      1.0000
Neg Pred Value                1.0000      0.8750      1.0000
Prevalence                    0.3333      0.3333      0.3333
Detection Rate                0.3333      0.2381      0.3333
Detection Prevalence          0.4286      0.2381      0.3333
Balanced Accuracy             0.9286      0.8571      1.0000

Can we do better with Random Forest?

set.seed(21)
rf_model <- train(kernname ~ . -kernnumber, 
                    data = training,
                    method = "ranger",
                    tuneLength = 10,
                    trControl = trainControl(method = "repeatedcv",
                                             number = 10,
                                             repeats = 5)
                    )

note: only 6 unique complexity parameters in default grid. Truncating the grid to 6 .

rf_model

Random Forest 

147 samples
  8 predictor
  3 classes: 'Canadian', 'Kama', 'Rosa' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 133, 132, 132, 132, 133, 132, ... 
Resampling results across tuning parameters:

  mtry  splitrule   Accuracy   Kappa    
  2     gini        0.9119414  0.8677411
  2     extratrees  0.9172747  0.8757578
  3     gini        0.9172747  0.8757411
  3     extratrees  0.9199414  0.8797578
  4     gini        0.9212747  0.8817411
  4     extratrees  0.9240366  0.8859117
  5     gini        0.9199414  0.8797411
  5     extratrees  0.9280366  0.8919117
  6     gini        0.9226081  0.8837411
  6     extratrees  0.9311136  0.8965751
  7     gini        0.9198462  0.8796207
  7     extratrees  0.9322418  0.8982536

Tuning parameter 'min.node.size' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were mtry = 7, splitrule =
 extratrees and min.node.size = 1.

Check Test Accuracy

class_prediction <- predict(rf_model, newdata = testing, type = "raw")
confusionMatrix(class_prediction, testing$kernname)

Confusion Matrix and Statistics

          Reference
Prediction Canadian Kama Rosa
  Canadian       21    4    0
  Kama            0   17    0
  Rosa            0    0   21

Overall Statistics
                                          
               Accuracy : 0.9365          
                 95% CI : (0.8453, 0.9824)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9048          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Canadian Class: Kama Class: Rosa
Sensitivity                   1.0000      0.8095      1.0000
Specificity                   0.9048      1.0000      1.0000
Pos Pred Value                0.8400      1.0000      1.0000
Neg Pred Value                1.0000      0.9130      1.0000
Prevalence                    0.3333      0.3333      0.3333
Detection Rate                0.3333      0.2698      0.3333
Detection Prevalence          0.3968      0.2698      0.3333
Balanced Accuracy             0.9524      0.9048      1.0000

Fit one more model (Elastic Net)

set.seed(21)
en_model <- train(kernname ~ . -kernnumber, 
                    data = training,
                    method = "glmnet",
                    tuneLength = 10,
                    trControl = trainControl(method = "repeatedcv",
                                             number = 10,
                                             repeats = 5)
                    )
en_model

glmnet 

147 samples
  8 predictor
  3 classes: 'Canadian', 'Kama', 'Rosa' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 133, 132, 132, 132, 133, 132, ... 
Resampling results across tuning parameters:

  alpha  lambda        Accuracy   Kappa    
  0.1    0.0001926872  0.9389231  0.9083614
  0.1    0.0004451325  0.9389231  0.9083614
  0.1    0.0010283139  0.9320513  0.8980191
  0.1    0.0023755386  0.9226227  0.8838485
  0.1    0.0054878024  0.9145128  0.8716369
  0.1    0.0126775356  0.9075458  0.8611821
  0.1    0.0292867523  0.9088791  0.8631821
  0.1    0.0676561979  0.9130842  0.8695241
  0.1    0.1562945961  0.9114505  0.8670609
  0.1    0.3610607975  0.9114505  0.8670609
  0.2    0.0001926872  0.9430183  0.9144826
  0.2    0.0004451325  0.9389231  0.9083614
  0.2    0.0010283139  0.9333846  0.9000191
  0.2    0.0023755386  0.9239560  0.8858485
  0.2    0.0054878024  0.9158462  0.8736369
  0.2    0.0126775356  0.9115458  0.8671821
  0.2    0.0292867523  0.9116410  0.8673360
  0.2    0.0676561979  0.9158462  0.8736779
  0.2    0.1562945961  0.9114505  0.8670609
  0.2    0.3610607975  0.9087839  0.8630609
  0.3    0.0001926872  0.9456850  0.9184826
  0.3    0.0004451325  0.9389231  0.9083614
  0.3    0.0010283139  0.9362564  0.9043614
  0.3    0.0023755386  0.9266227  0.8898485
  0.3    0.0054878024  0.9171795  0.8756369
  0.3    0.0126775356  0.9129744  0.8693360
  0.3    0.0292867523  0.9131795  0.8696779
  0.3    0.0676561979  0.9158462  0.8736779
  0.3    0.1562945961  0.9142125  0.8711819
  0.3    0.3610607975  0.9004029  0.8504781
  0.4    0.0001926872  0.9470183  0.9204826
  0.4    0.0004451325  0.9389231  0.9083614
  0.4    0.0010283139  0.9362564  0.9043614
  0.4    0.0023755386  0.9266227  0.8898485
  0.4    0.0054878024  0.9157509  0.8735480
  0.4    0.0126775356  0.9158462  0.8736369
  0.4    0.0292867523  0.9116410  0.8673360
  0.4    0.0676561979  0.9158462  0.8736779
  0.4    0.1562945961  0.9155458  0.8731819
  0.4    0.3610607975  0.8896410  0.8343076
  0.5    0.0001926872  0.9470183  0.9204826
  0.5    0.0004451325  0.9416850  0.9124826
  0.5    0.0010283139  0.9375897  0.9063614
  0.5    0.0023755386  0.9279560  0.8918485
  0.5    0.0054878024  0.9184176  0.8775480
  0.5    0.0126775356  0.9173846  0.8759788
  0.5    0.0292867523  0.9145128  0.8716779
  0.5    0.0676561979  0.9158462  0.8736779
  0.5    0.1562945961  0.9142125  0.8711819
  0.5    0.3610607975  0.8896410  0.8343076
  0.6    0.0001926872  0.9470183  0.9204826
  0.6    0.0004451325  0.9416850  0.9124826
  0.6    0.0010283139  0.9375897  0.9063614
  0.6    0.0023755386  0.9266227  0.8898485
  0.6    0.0054878024  0.9197509  0.8795480
  0.6    0.0126775356  0.9173846  0.8759788
  0.6    0.0292867523  0.9145128  0.8716779
  0.6    0.0676561979  0.9171795  0.8756779
  0.6    0.1562945961  0.9142125  0.8711819
  0.6    0.3610607975  0.8488645  0.7731243
  0.7    0.0001926872  0.9470183  0.9204826
  0.7    0.0004451325  0.9416850  0.9124826
  0.7    0.0010283139  0.9375897  0.9063614
  0.7    0.0023755386  0.9293846  0.8940191
  0.7    0.0054878024  0.9210842  0.8815480
  0.7    0.0126775356  0.9187179  0.8779788
  0.7    0.0292867523  0.9145128  0.8716779
  0.7    0.0676561979  0.9171795  0.8756779
  0.7    0.1562945961  0.9061172  0.8590445
  0.7    0.3610607975  0.8183883  0.7274322
  0.8    0.0001926872  0.9471136  0.9205877
  0.8    0.0004451325  0.9430183  0.9144826
  0.8    0.0010283139  0.9389231  0.9083614
  0.8    0.0023755386  0.9322564  0.8983614
  0.8    0.0054878024  0.9268278  0.8901909
  0.8    0.0126775356  0.9200513  0.8799788
  0.8    0.0292867523  0.9145128  0.8716779
  0.8    0.0676561979  0.9214799  0.8821532
  0.8    0.1562945961  0.9063223  0.8593864
  0.8    0.3610607975  0.8086593  0.7129840
  0.9    0.0001926872  0.9484469  0.9225877
  0.9    0.0004451325  0.9456850  0.9184826
  0.9    0.0010283139  0.9403516  0.9104826
  0.9    0.0023755386  0.9375897  0.9063614
  0.9    0.0054878024  0.9294945  0.8941909
  0.9    0.0126775356  0.9187179  0.8779788
  0.9    0.0292867523  0.9173846  0.8759994
  0.9    0.0676561979  0.9202418  0.8802906
  0.9    0.1562945961  0.9133846  0.8699827
  0.9    0.3610607975  0.7859560  0.6807828
  1.0    0.0001926872  0.9416850  0.9124338
  1.0    0.0004451325  0.9470183  0.9204826
  1.0    0.0010283139  0.9510183  0.9264826
  1.0    0.0023755386  0.9429231  0.9143614
  1.0    0.0054878024  0.9389231  0.9083614
  1.0    0.0126775356  0.9173846  0.8759788
  1.0    0.0292867523  0.9188132  0.8781368
  1.0    0.0676561979  0.9133846  0.8699994
  1.0    0.1562945961  0.9080513  0.8619994
  1.0    0.3610607975  0.6373700  0.4590265

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were alpha = 1 and lambda
 = 0.001028314.

en_model$bestTune

   alpha      lambda
93     1 0.001028314

# coef(en_model$finalModel)

plot(en_model)

Note: \(\alpha = 1 \rightarrow\) LASSO.

Check Test Accuracy

class_prediction <- predict(en_model, newdata = testing, type = "raw")
confusionMatrix(class_prediction, testing$kernname)

Confusion Matrix and Statistics

          Reference
Prediction Canadian Kama Rosa
  Canadian       20    1    0
  Kama            1   20    0
  Rosa            0    0   21

Overall Statistics
                                        
               Accuracy : 0.9683        
                 95% CI : (0.89, 0.9961)
    No Information Rate : 0.3333        
    P-Value [Acc > NIR] : < 2.2e-16     
                                        
                  Kappa : 0.9524        
 Mcnemar's Test P-Value : NA            

Statistics by Class:

                     Class: Canadian Class: Kama Class: Rosa
Sensitivity                   0.9524      0.9524      1.0000
Specificity                   0.9762      0.9762      1.0000
Pos Pred Value                0.9524      0.9524      1.0000
Neg Pred Value                0.9762      0.9762      1.0000
Prevalence                    0.3333      0.3333      0.3333
Detection Rate                0.3175      0.3175      0.3333
Detection Prevalence          0.3333      0.3333      0.3333
Balanced Accuracy             0.9643      0.9643      1.0000

Use `resamples()`

ANS <- resamples(list(TR = tree_model, RF = rf_model, EN = en_model))
summary(ANS)


Call:
summary.resamples(object = ANS)

Models: TR, RF, EN 
Number of resamples: 50 

Accuracy 
        Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
TR 0.7333333 0.8666667 0.9285714 0.8994652 0.9333333    1    0
RF 0.7333333 0.9297619 0.9333333 0.9322418 1.0000000    1    0
EN 0.8000000 0.9333333 0.9333333 0.9510183 1.0000000    1    0

Kappa 
   Min.   1st Qu.    Median      Mean 3rd Qu. Max. NA's
TR  0.6 0.8000000 0.8918903 0.8490197     0.9    1    0
RF  0.6 0.8942308 0.9000000 0.8982536     1.0    1    0
EN  0.7 0.9000000 0.9000000 0.9264826     1.0    1    0

bwplot(ANS)

Ideas for Grupo4

Alan T. Arnholt

4/1/2019

Read in Data

Partition Data

Train Model to find \(c_p\)

Prediction

Can we do better with Random Forest?

Check Test Accuracy

Fit one more model (Elastic Net)

Check Test Accuracy

Use `resamples()`

Ideas for Grupo4

Alan T. Arnholt

4/1/2019

Read in Data

Partition Data

Train Model to find \(c_p\)

Prediction

Can we do better with Random Forest?

Check Test Accuracy

Fit one more model (Elastic Net)

Check Test Accuracy

Use resamples()

Use `resamples()`