library(MASS)
library(dplyr)

The following variable screening method is called stepwise regression. This type of model selection adds and drops variables in the model until it finds the one with the most significant variables in it. The measure it uses to predict significance of the chosen variables is the AIC, this number gets smaller and smaller as more significant variables are added and then also penalizes the models for having too many variables in it.

We wanted to predict a teams post-season conference ranking with pre-season statistics, aquired from a large dataset being used for a senior thesis.

X_CFBRversionSEC <- X_CFBRversion %>%
  filter(SEC == 1)
#Stepwise Regression
sw<-step(lm(EPSNrank~FrNbrRecruits+Fr5star+Fr4star+Fr3star+Fravg+Sonbrrecruits+So5star+So4star+So3star+Soavg+Jrnbrrecruits+Jr5star+Jr4star+Jr3star+Jravg+Srnbrrecruits+Sr5star+Sr4star+Sr3star+Sravg+Rssrnbrrecruits+Rssr5star+Rssr4star+Rssr3star+Rssravg+z_lysagarin+z_tyasagarin+retoff+retdef+qbret+bowl+bowlwin+coachexp_school+coachexp_total+BigTen+SEC+BigTwelve+ACC+PacTen+Bigeast, data = X_CFBRversionSEC), direction = c("both"))
X_CFBRversionACC <- X_CFBRversion %>%
  filter(ACC == 1)
#Stepwise Regression
sw<-step(lm(EPSNrank~FrNbrRecruits+Fr5star+Fr4star+Fr3star+Fravg+Sonbrrecruits+So5star+So4star+So3star+Soavg+Jrnbrrecruits+Jr5star+Jr4star+Jr3star+Jravg+Srnbrrecruits+Sr5star+Sr4star+Sr3star+Sravg+Rssrnbrrecruits+Rssr5star+Rssr4star+Rssr3star+Rssravg+z_lysagarin+z_tyasagarin+retoff+retdef+qbret+bowl+bowlwin+coachexp_school+coachexp_total+BigTen+SEC+BigTwelve+ACC+PacTen+Bigeast, data = X_CFBRversionACC), direction = c("both"))
X_CFBRversionBigTen <- X_CFBRversion %>%
  filter(BigTen == 1)
#Stepwise Regression
sw<-step(lm(EPSNrank~FrNbrRecruits+Fr5star+Fr4star+Fr3star+Fravg+Sonbrrecruits+So5star+So4star+So3star+Soavg+Jrnbrrecruits+Jr5star+Jr4star+Jr3star+Jravg+Srnbrrecruits+Sr5star+Sr4star+Sr3star+Sravg+Rssrnbrrecruits+Rssr5star+Rssr4star+Rssr3star+Rssravg+z_lysagarin+z_tyasagarin+retoff+retdef+qbret+bowl+bowlwin+coachexp_school+coachexp_total+BigTen+SEC+BigTwelve+ACC+PacTen+Bigeast, data = X_CFBRversionBigTen), direction = c("both"))
X_CFBRversionPacTen <- X_CFBRversion %>%
  filter(PacTen == 1)
#Stepwise Regression
sw<-step(lm(EPSNrank~FrNbrRecruits+Fr5star+Fr4star+Fr3star+Fravg+Sonbrrecruits+So5star+So4star+So3star+Soavg+Jrnbrrecruits+Jr5star+Jr4star+Jr3star+Jravg+Srnbrrecruits+Sr5star+Sr4star+Sr3star+Sravg+Rssrnbrrecruits+Rssr5star+Rssr4star+Rssr3star+Rssravg+z_lysagarin+z_tyasagarin+retoff+retdef+qbret+bowl+bowlwin+coachexp_school+coachexp_total+BigTen+SEC+BigTwelve+ACC+PacTen+Bigeast, data = X_CFBRversionPacTen), direction = c("both"))
X_CFBRversionBigTwelve <- X_CFBRversion %>%
  filter(BigTwelve == 1)
#Stepwise Regression
sw<-step(lm(EPSNrank~FrNbrRecruits+Fr5star+Fr4star+Fr3star+Fravg+Sonbrrecruits+So5star+So4star+So3star+Soavg+Jrnbrrecruits+Jr5star+Jr4star+Jr3star+Jravg+Srnbrrecruits+Sr5star+Sr4star+Sr3star+Sravg+Rssrnbrrecruits+Rssr5star+Rssr4star+Rssr3star+Rssravg+z_lysagarin+z_tyasagarin+retoff+retdef+qbret+bowl+bowlwin+coachexp_school+coachexp_total+BigTen+SEC+BigTwelve+ACC+PacTen+Bigeast, data = X_CFBRversionBigTwelve), direction = c("both"))

So through this process, we found that different variables were deemed significant in different conferences. Essentially it means that different things are necessary to be successful relative to the other teams in your conference, based on a given conference.

SEC: “Fr5star”,“Fr4star”,“So4star”,“Soavg”,“Jrnbrrecruits”,“Jr5star”,“Jr4star”,“Rssr5star”,“Rssr4star”,“z_lysagarin”,“coachexp_school”,“coachexp_total”

ACC:“FrNbrRecruits” , “Fr4star” , “Sonbrrecruits”, “Jrnbrrecruits”, “Jr5star”, “Jr4star”, “Jr3star”, “Jravg”, “Rssr4star”, “z_lysagarin”,“retoff”, “bowl” ,“coachexp_total”

BigTen: “FrNbrRecruits”,“Fr3star”,“So4star”,“Jrnbrrecruits”,“Jr5star”,“Jr4star”,“Jr3star”,“Jravg”,“Srnbrrecruits”,“Sr3star”,“z_lysagarin”,“coachexp_total”

PacTen:“FrNbrRecruits” , “Fr4star” , “Sonbrrecruits”, “So4star”, “Soavg”, “Jrnbrrecruits” , “Jravg” , “Sr3star” , “Rssrnbrrecruits” , “Rssr5star” , “Rssr4star” , “Rssr3star” , “Rssravg” , “z_lysagarin” , “z_tyasagarin” , “retdef” , “qbret” , “coachexp_school” , “coachexp_total” , “Sravg”

BigTwelve:“FrNbrRecruits”, “Fr4star”, “Fr3star”, “Fravg”, “Sonbrrecruits”,“Soavg” , “Jrnbrrecruits” , “Jr5star” , “Srnbrrecruits” , “Rssrnbrrecruits” , “Rssravg” , “z_lysagarin” , “coachexp_school” , “coachexp_total”

Next, we subsetted the data for each conference so it only contained the variables that were deemed important in the variable screening process.

This facilitated the process of creating a function that will take in a string as a conference (i.e. “SEC”) and a year and outputs that conference’s rankings for that year

The function subsets the data into the conference you specify, and then subsets it into 2 datasets: one that contains all the data from every year except the year you specified (this is your training data), and one that is solely the data of year you specified (this is your test data). The multliple linear regression model is fit based on the training data, and then you use that model to predict the test data.

SECsubset <- as.data.frame(X_CFBRversion[,c(1,2,3,7,8,14,16,18,19,20,31,32,38,46,47,50)])
 SECsubset<- SECsubset %>%
   filter(SEC == 1)
ACCsubset <- as.data.frame(X_CFBRversion[,c(1,2,3,6,8,12,18,19,20,21,22,32,38,41,44,47,52)])
 ACCsubset<- ACCsubset %>%
   filter(ACC == 1)
PAC10subset <- as.data.frame(X_CFBRversion[,c(1,2,3,6,8,12,14,16,18,22,27,30,31,32,33,34,38,40,42,43,46,47,28,53)]) 
PAC10subset<- PAC10subset %>%
   filter(PacTen == 1)
Big10subset <- as.data.frame(X_CFBRversion[,c(1,2,3,6,9,14,18,19,20,21,22,24,27,38,47,49)])
 Big10subset<- Big10subset %>%
   filter(BigTen == 1)
Big12subset <- as.data.frame(X_CFBRversion[,c(1,2,3,6,8,9,10,12,16,18,19,24,30,34,38,46,47,51)])
 Big12subset<- Big12subset %>%
   filter(BigTwelve == 1)
predRank<- function(x,y)
{
  dat<-subset(SECsubset, SECsubset$Year != y)
  newdat<-subset(SECsubset, SECsubset$Year == y)
  if(x=="SEC") {
    colnames(dat) <- c("Team","Year","EPSNrank","Fr5star","Fr4star","So4star","Soavg","Jrnbrrecruits","Jr5star","Jr4star","Rssr5star","Rssr4star","z_lysagarin","coachexp_school","coachexp_total")
    sw<- lm(EPSNrank ~ Fr5star + Fr4star + So4star + Soavg + Jrnbrrecruits + 
    Jr5star + Jr4star + Rssr5star + Rssr4star + z_lysagarin + 
    coachexp_school + coachexp_total, data = dat)
    preds<-predict(sw, newdata = newdat)
    predset<-t(rbind(newdat$Team,newdat$EPSNrank,preds))
     preddf<-as.data.frame(predset)
  }
  if(x=="ACC") {
    dat<-subset(ACCsubset, ACCsubset$Year != y)
    newdat<-subset(ACCsubset, ACCsubset$Year == y)
    colnames(dat) <- c("Team","Year","EPSNrank", "FrNbrRecruits" , "Fr4star" , "Sonbrrecruits", "Jrnbrrecruits", "Jr5star", "Jr4star", "Jr3star", "Jravg", "Rssr4star", "z_lysagarin","retoff", "bowl" ,"coachexp_total")
  
    fw<- lm(EPSNrank ~ FrNbrRecruits + Fr4star + Sonbrrecruits + Jrnbrrecruits + 
    Jr5star + Jr4star + Jr3star + Jravg + Rssr4star + z_lysagarin + 
    retoff + bowl + coachexp_total, data = dat)
    
  preds<-predict(fw, newdata = newdat)
  predset<-t(rbind(newdat$Team, newdat$EPSNrank,preds))
   preddf<-as.data.frame(predset)
  }
  if(x=="BigTen"){
    dat<-subset(Big10subset, Big10subset$Year != y)
    newdat<-subset(Big10subset, Big10subset$Year == y)
  colnames(dat) <- c("Team","Year","EPSNrank","FrNbrRecruits","Fr3star","So4star","Jrnbrrecruits","Jr5star","Jr4star","Jr3star","Jravg","Srnbrrecruits","Sr3star","z_lysagarin","coachexp_total")
  bw <- lm(EPSNrank ~ FrNbrRecruits + Fr3star + So4star + Jrnbrrecruits + 
    Jr5star + Jr4star + Jr3star + Jravg + Srnbrrecruits + Sr3star + 
    z_lysagarin + coachexp_total, data = dat)
  preds<-predict(bw, newdata = newdat)
  predset<-t(rbind(newdat$Team,newdat$EPSNrank,preds))
  preddf<-as.data.frame(predset)
  }
  if(x=="PacTen"){
    dat<-subset(PAC10subset, PAC10subset$Year != y)
    newdat<-subset(PAC10subset, PAC10subset$Year == y)
  colnames(dat) <- c("Team","Year","EPSNrank", "FrNbrRecruits" , "Fr4star" , "Sonbrrecruits", "So4star", "Soavg", "Jrnbrrecruits" , "Jravg" , "Sr3star" , "Rssrnbrrecruits" , "Rssr5star" , "Rssr4star" , "Rssr3star" , "Rssravg" , "z_lysagarin" , "z_tyasagarin" , "retdef" , "qbret" , "coachexp_school" , "coachexp_total" , 
    "Sravg")
  bw <- lm(EPSNrank ~ FrNbrRecruits + Fr4star + Sonbrrecruits + So4star + 
    Soavg + Jrnbrrecruits + Jravg + Sr3star + Rssrnbrrecruits + 
    Rssr5star + Rssr4star + Rssr3star + Rssravg + z_lysagarin + 
    z_tyasagarin + retdef + qbret + coachexp_school + coachexp_total + 
    Sravg, data = dat)
  preds<-predict(bw, newdata = newdat)
  predset<-t(rbind(newdat$Team,newdat$EPSNrank,preds))
  preddf<-as.data.frame(predset)
  }
  if(x =="BigTwelve"){
    dat<-subset(Big12subset, Big12subset$Year != y)
    newdat<-subset(Big12subset, Big12subset$Year == y)
    colnames(dat) <- c("Team","Year","EPSNrank","FrNbrRecruits", "Fr4star", "Fr3star", "Fravg", "Sonbrrecruits","Soavg" , "Jrnbrrecruits" , "Jr5star" , "Srnbrrecruits" , "Rssrnbrrecruits" , "Rssravg" , "z_lysagarin" , "coachexp_school" , "coachexp_total")
  
    fw<-lm(EPSNrank ~ FrNbrRecruits + Fr4star + Fr3star + Fravg + Sonbrrecruits + 
    Soavg + Jrnbrrecruits + Jr5star + Srnbrrecruits + Rssrnbrrecruits + 
    Rssravg + z_lysagarin + coachexp_school + coachexp_total, data = dat)
    
  preds<-predict(fw, newdata = newdat)
  predset<-t(rbind(newdat$Team, newdat$EPSNrank,preds))
   preddf<-as.data.frame(predset)
  }
   preddf$V2<-as.numeric(as.character(preddf$V2))
  preddf$preds<-as.numeric(as.character(preddf$preds))
  
 return(preddf)
}

Next, we predicted the conference rankings for the 2018 season using the function we created above. The results are below. One limitation is that the function predicted the rankings as doubles instead of integers. Therefore, we just ordered the predictions least to greatest, as you can see below.

SEC<-predRank("SEC",2018)
ACC<-predRank("ACC",2018)
BigTen<-predRank("BigTen",2018)
PacTen<-predRank("PacTen",2018)
BigTwelve<-predRank("BigTwelve",2018)
SEC<-SEC[order(SEC$preds),]
SEC
                   V1 V2      preds
149           Georgia  2  0.7483953
151               LSU  4  3.8811658
145           Alabama  1  5.6032631
150          Kentucky  5  5.8993797
157         Texas A&M  6  7.1220021
154          Ole Miss 13  7.3828862
155    South Carolina  9  7.4111542
147            Auburn 10  7.8791257
153          Missouri  7  8.3805134
152 Mississippi State  8  9.0233377
156         Tennessee 12  9.3217204
146          Arkansas 14 10.2480061
158        Vanderbilt 11 10.4361920
148           Florida  3 11.4028449
ACC<-ACC[order(ACC$preds),]
ACC
                V1 V2     preds
149       Miami-FL  8  1.834273
146  Florida State 12  1.880734
155  Virginia Tech  9  3.276114
150       NC State  4  4.534099
148     Louisville 14  4.671640
156    Wake Forest 11  4.825418
154       Virginia  7  4.908386
144        Clemson  1  4.975559
143 Boston College  6  5.566071
145           Duke 10  8.241210
147   Georgia Tech  5  9.086959
153       Syracuse  2  9.579264
151 North Carolina 13 10.772782
152     Pittsburgh  3 12.379287
BigTen<-BigTen[order(BigTen$preds),]
BigTen
                V1 V2      preds
147     Penn State  4 -0.4226229
142 Michigan State  7  2.9731629
146     Ohio State  1  4.0818182
141       Michigan  2  5.3926006
139           Iowa  5  5.5232166
144       Nebraska 11  7.4059580
148         Purdue  8  8.8154271
145   Northwestern  3  8.9978007
143      Minnesota  9  9.5228485
137       Illinois 13  9.8637852
138        Indiana 12 10.1719705
140       Maryland 10 10.5101409
149        Rutgers 14 12.1588026
150      Wisconsin  6         NA
PacTen<-PacTen[order(PacTen$preds),]
PacTen
                V1 V2    preds
135     Washington  2 4.498936
126  Arizona State  6 5.861611
125        Arizona  8 6.033062
131   Southern Cal  9 6.131612
127     California  7 6.304335
132       Stanford  4 7.533845
129         Oregon  5 7.898436
128       Colorado 11 8.350038
130   Oregon State 12 8.539411
136 Washington St.  1 8.610382
133           UCLA 10 8.979491
134           Utah  3 9.280296
BigTwelve<-BigTwelve[order(BigTwelve$preds),]
BigTwelve
                V1 V2     preds
123       Oklahoma  1  3.572277
122   Kansas State  9  3.915932
124 Oklahoma State  7  4.489035
126          Texas  2  4.738465
120     Iowa State  4  5.826364
128  West Virginia  3  6.952349
127     Texas Tech  8  7.402782
125            TCU  6  8.590376
119         Baylor  5  9.810153
121         Kansas 10 10.416551

Our predictions weren’t very accuarate, however usually the predicted top ranked team is in the top 5 (ish).

We also used another method of predicting, decision trees. We wrote a very similar function to predict a given seasons conference rankings for each conference. The function takes in a string as a conference (i.e. “SEC”) and a year and predicts that conference’s rankings for that year, just like the one above but based on the decision tree model.

There is a new decision tree created for each conference. The process for this function is essentially the exact same as the one above, except decision trees are used to predict.

One thing to note about the tree diagram is that it predicts the ranking in factors essentially. Therefore a lot of “ties” show up in the predictions, as you will see below.

SECTree<-treePredRank("SEC",2018)
ACCTree<-treePredRank("ACC",2018)

BigTenTree<-treePredRank("BigTen",2018)

PacTenTree<-treePredRank("PacTen",2018)

BigTwelveTree<-treePredRank("BigTwelve",2018)

These predicitions are also not very accurate. However, the ties are interesting because you can see how teams differ in the post-season that were predicted to perform the same.

Overall, preseason statistics dont seem to be a very good predictor of post-season conference rankings, whether you are using multilple linear regression and variable screening methods or decision trees.

An important result we attained was how different conferences yield different predictors of success.

