---
title: "Machine Learning Final Project"
author: "Bradley Bush, Sandy Tanwisuth, Jesse St. Amand"
date: "May 4, 2017"
output:
  pdf_document: default
  html_document: null
header-includes:
  \usepackage{float}
  \usepackage{graphicx}
---

```{r echo=FALSE,warning=FALSE}
 library(knitr)
  opts_chunk$set(fig.path='figure/graphics-', 
                 cache.path='cache/graphics-', 
                 fig.align='center',
                 external=TRUE,
                 echo=TRUE,
                 warning=FALSE,
                 fig.pos='H'
                )
  a4width<- 8.3
  a4height<- 11.7
```

# Introduction

Syngenta is one of the largest seed producers in the world. Currently they have a multi-year process by which they breed, test, and select varieties to sell comercially. They are interested in improving their selection process and have partnered with Idea Connection to host a data mining competition. The purpose of the competition (see:https://www.ideaconnection.com/Syngenta-AI-Challenge/) is to give teams a chance to use AI and data mining techniques (along with their creativity) to come up with ways to improve the selection process and allow Syngenta to 'grow more with less.'

We have decided to use the data from this competition for our project. We are attempting to predict soybean yield from environmental factors such as the amount of sun and precipitation the soybean plant gets during its growing season and from soil properties such as the amount of sand, silt, or organic matter present in the soil at the testing site. A full description of the variables used is given below.


# The Data

Here is a description of the variables:

Variable Name:  | Variable Description:
---------------| ---------------------------------------------------------------------------
`YIELD`         | Variety yield in bushels per acre (this is an average because you may have multiple experiments of the same variety at the same location-- IE at different parts of the test site)
`AREA`          | The number of acres growing soybeans in the segment of about 36 sq miles around the testing location. The information was obtained from http://nassgeodata.gmu.edu/CropScape (NASS USDA, 2014)
`IRRIGATION`    | The number of acres with irrigation in the segment of about 36 sq miles around the testing location. The information was obtained from http://nassgeodata.gmu.edu/CropScape (NASS USDA, 2014)
`AWC_100CM`     | The available soil water capacity (volumetric fraction) until wilting point. This information was summarized for the first 100 cm.
`TEMP`          | Temperature accumulation during the season (in Celsius). Meaning the sum of average daily temperatures of test site from 1 Apr to 31 Oct.
`PREC`          | Precipitation accumulation during the season (in millimeters). Meaning the sum of average daily precipitation of test site from 1 Apr to 31 Oct.
`RAD`           | Solar radiation accumulation during the season (in watts/meter). Meaning the sum of average daily solar radiation of test site from 1 Apr to 31 Oct.
`SAND_TOP`      | Percentages of sand particles according to size (small, less than 0.002 mm; medium, between 0.002 and 0.05 mm; and large, more than 0.05 mm, respectively). These proportions define soil texture. This information was summarized for the first 30 cm.
`SILT_TOP`      | Percentages of silt particles according to size (small, less than 0.002 mm; medium, between 0.002 and 0.05 mm; and large, more than 0.05 mm, respectively). These proportions define soil texture. This information was summarized for the first 30 cm.
`CLAY`          | Percentages of clay particles according to size (small, less than 0.002 mm; medium, between 0.002 and 0.05 mm; and large, more than 0.05 mm, respectively). These proportions define soil texture. This information was summarized for the first 30 cm.
`PH`            | The log of H+ concentration in the soil. Acidic soils have low pH values and high H+ concentration and alkaline soils have high pH values. It impacts soil chemical reactions and the ability of the soil to supply nutrient to plants. Optimum pH range is between 6.5 and 7. This information was summarized for the first 30 cm of the soil profile. 
`ORGANIC.MATTER`| The percentage of the soil consisting of plant and animal residues at various stages of decomposition, soil organisms and their byproducts. It impacts soil chemical reactions and soil structure. It is an important indicator of the ability of the soil to supply nutrient and water to crops. This information was summarized for the first 30 cm of the soil profile.
`CEC`           | The Cation Exchange Capacity (cmol kg-1) quantifies the amount of negative charge in the soil. It impacts soil chemical reactions and the ability of the soil to supply nutrient to plants. It is often associated with clay and organic matter content. This information was summarized for the first 30 cm of the soil profile.




## Processing Data

Much of our time was spent in this section of the project. The data is given to us in two different files. The first file had the soil and environmental data and the second file had the information about the soybean varities. We had to first combine the datasets by taking each observation (ie a given soybean variety for a given year) and appending the appropriate weather and soil data so that we could have all of the data in one dataframe. Once we combine the data we have 172,057 observations and 66 variables. Not all of these variables were useful (for example years of weather data for which we did not have soybean yields), so we had to create another dataset with only the variables we wanted to use. When running our models we split the data into three data sets: 50% training data, 25% validation data, and 25% test data. 

```{r  include=FALSE, eval=FALSE}
# read in the soy training data (minus geographic information)
sdat1 = read.csv("TRAINING_DATASET.csv", header = TRUE)

n = nrow(sdat1) # find number of rows in training data `sdat1`

# read in geographic data for all locations in training data `sdat1` 
sdat2 = read.csv("Geographic_information_updated.csv", header = TRUE)

# merge the training data with the geographic information for each observation in variable `LOCATION'
sdat.merged = merge(sdat1, sdat2, by="LOCATION")

# delet weather data for years 2001-2008 (as we don't have varieties planted in those years)
sdat.merged1 = sdat.merged[,c(1:15,24:29,38:43,52:64)]
wdat = sdat.merged1[,16:33]   # save weather data before deleting it below
sdat = sdat.merged1[,c(1:15,34:40)]  # take out weather data to be sorted and added later according to year

# create temp vector with three columns for temperature, precipitation, and solar radiation (`TEMP`, `PREC`, and `RAD` respectively)

tempvec = data.frame(matrix(nrow = n, ncol = 3))  # create temp vec
colnames(tempvec) = c("TEMP","PREC","RAD")   # add column names



for(i in 1:n) {

  temp.year = sdat$YEAR[i]  # get year in ith row
  temp.year = temp.year%%2000  # this returns remainder of temp.year/2000 ie if temp.year = 2011, then this returns 11
  
  if (temp.year==9) {  # if year is 2009, then add a '0' to column name
    
    index = which(colnames(wdat)==paste("TEMP_0", temp.year, sep="") | colnames(wdat)==paste("PREC_0", temp.year, sep="") | colnames(wdat)==paste("RAD_0", temp.year, sep=""))  # find column indeces for temp, prec, and rad for year in i^th row
  
  tempvec[i,] = wdat[i,index]  # copy temp, prec, and rad values from `wdat` for 3 col indeces saved in `index` into `tempvec` all for row i 
  
  } else {
  
  
  index = which(colnames(wdat)==paste("TEMP_", temp.year, sep="") | colnames(wdat)==paste("PREC_", temp.year, sep="") | colnames(wdat)==paste("RAD_", temp.year, sep=""))  # find column indeces for temp, prec, and rad for year in i^th row
  
  tempvec[i,] = wdat[i,index]  # copy temp, prec, and rad values from `wdat` for 3 col indeces saved in `index` into `tempvec` all for row i 
  
  }
  
}

sdat.all = cbind(sdat,tempvec)  # add `tempvec` to the right side of `sdat`  

write.csv(sdat.all, file = "SoyData.csv") # save `sdat` as a SoyData.csv

sdat = sdat.all[,c("YIELD","AREA","IRRIGATION","CEC","PH","ORGANIC.MATTER","CLAY","SILT_TOP","SAND_TOP","AWC_100CM","TEMP","PREC","RAD")]  # pull out only the variables that have information ie `repno` is the rep number and is useless as a variable 


```


# Modeling the Data: 

We tried 3 different methods from our Machine Learning class. We focused on Random Forests, Deep Neural Nets, and KNN. We mention Linear Regression (as a reference baseline) and Boosting (as a method to explore in the future). We tuned Random Forests, Neural Nets, and KNN, and discuss each method with its corresponding results below.

## Baseline: Generalized Linear Models

To give us a baseline for OOB performance, we chose a linear regression model using the package `glm`. We use stepwise eliminination to find our model using the `step` function. This gave us our baseline RMSE of $13.0193$ bushels per acre. Our baseline model is given below.

$$Yield= 32.61 + 0.00056*area + 0.00074*irrigation + 2.330*organic.matter + 0.325*silt.top - 0.449*cec + \varepsilon$$


```{r warning=FALSE,include=FALSE, eval=FALSE}
##Read in the data file using a library called "readr".

#read in text file using library readr so that all columns are in the right format
library(readr)
#read in data to a variable called SoyData
SoyData <- read_csv("SoyData.csv")
```


```{r resampling, warning=FALSE,include=FALSE, eval=FALSE}
##In this section, we resampling the data to 1000 observations and check the noramlity assumption. 

#use random sampling to create a training set called Soy1000
Soy1000 <- SoyData[sample(nrow(SoyData), 1000), ]
#attach column-names
attach(SoyData)

#check column-names
colnamesSoy <- colnames(SoyData)
classSoy <- class(SoyData)

#check normality
shapiro.test(Soy1000$YIELD)
```



```{r glm, warning=FALSE,include=FALSE, eval=FALSE}

#create additive model
model  <- glm( YIELD ~ AREA + IRRIGATION + CEC + PH + ORGANIC.MATTER + CLAY + SILT_TOP +
                 AWC_100CM + TEMP + SAND_TOP, data=Soy1000)

#plot the models
plot(model)
#summary model
summary.mod <- summary(model)
```


```{r step, warning=FALSE,include=FALSE, eval=FALSE}
##This section we were using the stepwise method to elminate insignificant variables.

#stepwise eliminations to eliminate insignificant variables
step(model)
```


```{r afterstep, warning=FALSE,include=FALSE, eval=FALSE}
##We recreated another additive model after the stepwise method.

#create new model after stepwise
new.model  <- glm( YIELD ~ AREA + IRRIGATION + ORGANIC.MATTER + SILT_TOP , data=Soy1000)
summarynew <- summary(new.model)

#re-plot the new model
plot(new.model)

#create a new data set using random sample for testing
Soy1000Test <- SoyData[sample(nrow(SoyData), 1000), ]

#calculate rmse
score <- predict(new.model, Soy1000) 
actual <- Soy1000Test$YIELD
rmse <- (mean((actual - score)^2))^0.5 
rmse

##The RMSE looked promising but not as good as other methods.
```



## Random Forests

With Random Forests we tried to optimize using a spread of values for `mtry` and `ntree` as well as using an optimization function called `tuneRF` from the `randomForest` package, which optimizes over `mtry` in a more thorough fashion. Our results didn't vary much over `mtry` values and were fairly consistent over `ntree` values.


We have included a table of values along with a graph below. Our final Random Forests model choice had an OOB RMSE of $7.9042$ bushels per acre with `mtry` set to 6 and `ntree` set to 1500.


![ntree values of 500, 1000, 1500, and 2000 are used with mtry values of 3, 4, 6, 9 and 12. The results do not vary much for any of the settings.](C:\Users\bra2194372\Dropbox\Data\Rf.RMSEvsMTRY.PNG)

\pagebreak

mtry:  | ntree=500 | ntree=1000 | ntree=1500 | ntree=2000 
-----| ----------| ----------| ----------| ----------|
3 | 7.926718 | 7.926793 | 7.926908 | 7.925924
4 | 7.909572 | 7.909412 | 7.909228 | 7.909150
6 | 7.904852 | 7.904379 | 7.904189 | 7.904188
9 | 7.904154 | 7.904245 | 7.904059 | 7.904156
12| 7.904267 | 7.904182 | 7.904173 | 7.904311
Table 1: List of values for `ntree` and `mtry`

```{r include=FALSE, eval=FALSE}

# Random Forests!

#load libraries
library(randomForest)
library(gbm) #boosting

#shortcut to saved `SoyData.csv`
sdat = read.csv("SoyData.csv", header = TRUE)

# create training, validation, and test data
set.seed(14) # my birthday
n=nrow(sdat)  # find number of obs in data

n1=floor(n/2)  # find half of number of obs (rounded) in data (for training data)
n2=floor(n/4)  # find fourth of number of obs (rounded) in data (for validataion data)
n3=n-n1-n2   # find fourth (ie 1 - 1/2 - 1/4 rounded ) of number of obs (rounded) in data (for test data)

ii = sample(1:n,n) # randomly reorders n elements
sdat.train = sdat[ii[1:n1],]
sdat.val = sdat[ii[(n1+1):(n1+n2)],]
sdat.test = sdat[ii[(n1+n2+1):n],]

# we fit a model on the training data `sdat.train` using random forests, then predict on the validation data `sdat.val`
# Note: mtry is the number of randomly choosen variables to use when building the 'random forests'

# Note: we use mtry=floor(sqrt(ncol(sdat))) = 3 as our default value for mtry  
rf.fit = randomForest(YIELD~.,data=sdat.train, mtry=3, ntree=500)  # fit random forests model on training data
rfvalpred = predict(rf.fit, newdata=sdat.val)  # predict using model `rf.fit` on validation data

# let's try a larger number of trees
rf.fit1 = randomForest(YIELD~.,data=sdat.train, mtry=9, ntree=2000)  # fit random forests model on training data
rfvalpred = predict(rf.fit1, newdata=sdat.val)  # predict using model `rf.fit` on validation data


rf.fit2 = randomForest(YIELD~.,data=sdat.train, mtry=12, ntree=2000)  # fit random forests model on training data
rfvalpred = predict(rf.fit2, newdata=sdat.val)  # predict using model `rf.fit` on validation data

# mean(rf.fit2$mse)  # use this to get last value rmse of 20000

bestmtry.df = as.data.frame(bestmtry) # make them all dataframes
bestmtry2.df = as.data.frame(bestmtry2)
bestmtry3.df = as.data.frame(bestmtry3)
bestmtry4.df = as.data.frame(bestmtry4)

bestmtry4.1.df = rbind(bestmtry4,c(12,mean(rf.fit2$mse))) # add the MSE

rf.results= cbind(bestmtry.df[,1],sqrt(bestmtry.df[,2]),sqrt(bestmtry2.df[,2]),sqrt(bestmtry3.df[,2]),sqrt(bestmtry4.1.df[,2]))
row.names(rf.results)= c("3", "4","6","9", "12")
colnames(rf.results)= c("mtry", "500", "1000", "1500", "2000")

# plot results
plot(rf.results[,1],rf.results[,2],type="n",xlab="mtry",ylab="RMSE",cex.lab=1.5, main = "RMSE vs mtry")
for(i in 2:5) lines(rf.results[,1],rf.results[,i],pch=18,col=i,lty=2,lwd=3) #plot each ntree curve
leg.text = c("500", "1000","1500","2000")
legend("topright",leg.text, col=c("red","green","blue","cyan"),lty = 1, lwd=3)


# Cross validation with Random Forests using function `rfcv()`

rf.fit.cv = rfcv(sdat.train[,-1], sdat.train[,1], cv.fold=10)


# tune Random Forests with the tuneRF() funcion in package randomForest

# Algorithm Tune (tuneRF)
set.seed(14)
bestmtry <- tuneRF(sdat.train[,-1], sdat.train[,1], stepFactor=1.5, improve=1e-5, ntree=500)
print(bestmtry)

set.seed(14)
bestmtry2 <- tuneRF(sdat.train[,-1], sdat.train[,1], stepFactor=1.5, improve=1e-5, ntree=1000)
print(bestmtry2)

set.seed(14)
bestmtry3 <- tuneRF(sdat.train[,-1], sdat.train[,1], stepFactor=1.5, improve=1e-5, ntree=1500)
print(bestmtry3)

set.seed(14)
bestmtry4 <- tuneRF(sdat.train[,-1], sdat.train[,1], stepFactor=1.5, improve=1e-5, ntree=2000)
print(bestmtry4)

# plot graph of mtry options

bestmtry.df = as.data.frame(bestmtry)  #make it a dataframe
windows()
plot(bestmtry.df$mtry,bestmtry.df$OOBError, xlab = "Mtry", ylab = "OOBError", type = "o", main = "Mtry vs MSE (ie OOBError), ntree = 500" )



```


## KNN

We used the r package `kknn` for our knn models. Tuning these models was extremely time consuming (read: computationally expensive), but they ended up performing surprisingly well. Below is a table of the RMSE values for various k values.

|k | RMSE |
|--|------|
| 5 |8.592175 |
| 8 |8.341477 |
| 11 |8.217457 |
| 14 |8.172575 |
| 17 |8.126609 |
| 20 |8.102771 |
| 23 |8.089667 |
| 26 |8.076027 |
| 29 |8.074170 |
| 32 |8.067071 |
| 35 |8.062278 |
| 38 |8.055297 |
| 41 |8.054643 |
| 44 |8.055504 |
| 47 |8.057311 |
| 50 |8.056076 |
| 53 |8.062764 |
| 56 |8.067178 |
| 59 |8.066341 |
| 62 |8.069128 |
| 65 |8.075273 |
| 68 |8.076966 |

Figure 3. Table of k values and the RMSE of each model as depicted in figure 2. The best RMSE is ~8.055 found at a k of 41.

\pagebreak

![Preliminary figure of k verses cvmean, which compares the paramter k at values between 5 and 68 with the RMSE of a single model. This was used to test the approximate range of appropriate k values before moving onto the more computationally expensive model below.](C:\Users\bra2194372\Dropbox\Data\cvmean_kv.JPG)

\pagebreak

![Plot of k verses cvmean, which compares the paramter k at values between 5 and 68 with the RMSE of each model (calculated by comparing against the testing data). Each colored line represents a different subset of the training data that, when averaged together to produce the black line, reduces bias and varience in the model.](C:\Users\bra2194372\Dropbox\Data\cvmean_k.JPG)

\pagebreak

![Pairwise comparison of kNN at a k of 41 against a linear model. The greater linearity in the y versus kNN41 plots over the y vs linear plots give a visual representation that the kNN model is a better predictor for the data than the linear model.](C:\Users\bra2194372\Dropbox\Data\fmat.JPG)



```{r include=FALSE, eval=FALSE}
library(kknn)
sdat.all <- read.csv("SoyData2.csv", header = TRUE)
```

```{r  include=FALSE, eval=FALSE}
# pull out only the variables that have information ie `repno` is the rep number and is useless as a variable 
sdat=sdat.all[,c("YIELD","AREA","IRRIGATION","CEC","PH","ORGANIC.MATTER","CLAY","SILT_TOP","SAND_TOP","AWC_100CM","TEMP","PREC","RAD")] 

# create training, validation, and test data
set.seed(14) # my birthday
n=nrow(sdat)  # find number of obs in data

n1=floor(n/2)  # find half of number of obs (rounded) in data (for training data)
n2=floor(n/4)  # find fourth of number of obs (rounded) in data (for validataion data)
n3=n-n1-n2   # find fourth (ie 1 - 1/2 - 1/4 rounded ) of number of obs (rounded) in data (for test data)

ii = sample(1:n,n) # randomly reorders n elements
trainDf = sdat[ii[1:n1],]
valDf = sdat[ii[(n1+1):(n1+n2)],]
train2Df = rbind(trainDf,valDf)
testDf = sdat[ii[(n1+n2+1):n],]

```

```{r include=FALSE, eval=FALSE}
y = trainDf[,1]
x = trainDf[,2:12]
print(dim(x[,1]))
mmsc=function(x) {return((x-min(x))/(max(x)-min(x)))}
xs = apply(x,2,mmsc)
#plot y vs each x
par(mfrow=c(1,2)) #two plot frames

```
```{r  include=FALSE, eval=FALSE}
par(mfrow=c(1,1))
set.seed(99)
kv = seq(5,70,3) #k values to try
n = length(y)
cvtemp = docvknn(xs,y,kv,nfold=10)
cvtemp = sqrt(cvtemp/n) #docvknn returns sum of squares
plot(kv,cvtemp)
```
```{r  include=FALSE, eval=FALSE}
#run cross val several times
set.seed(99)
cvmean = rep(0,length(kv)) #will keep average rmse here
ndocv = 10 #number of CV splits to try
n=length(y)
cvmat = matrix(0,length(kv),ndocv) #keep results for each split
for(i in 1:ndocv) {
  cvtemp = docvknn(xs,y,kv,nfold=10)
  cvmean = cvmean + cvtemp
  cvmat[,i] = sqrt(cvtemp/n)
}
```
```{r  include=FALSE, eval=FALSE}
cvmean = cvmean/ndocv
cvmean = sqrt(cvmean/n)
plot(kv,cvmean,type="n",ylim=range(cvmat),xlab="k",cex.lab=1.5)
for(i in 1:ndocv) lines(kv,cvmat[,i],col=i,lty=3) #plot each result
lines(kv,cvmean,type="b",col="black",lwd=5) #plot average result
```


```{r  include=FALSE, eval=FALSE}
#refit using all the data and k=42
ddf = data.frame(y,xs)
near41 = kknn(y~.,ddf,ddf,k=41,kernel = "rectangular")
lmf = lm(y~.,ddf)
fmat = cbind(y,near42$fitted,lmf$fitted)
colnames(fmat)=c("y","kNN42","linear")
pairs(fmat)
print(cor(fmat))
```



## Neural Net

We used the `h2o` package for running our neural nets. It was very different from the other packages we have worked with (as mentioned in class) and took some time to get used to. 

The following table was created by running iterations of the neural network model with different parameters. In attempts 1-10, the RMSE (number of bushels per acre) was calculated by comparing the training data with the validation set. The final attempt found the RMSE of the best conditions (9), applied to the second training set (the first training data plus the validation data) and the test data. The greatest differenses in RMSE occured with changes to the hidden layers, with the best values coming out of the 400x400x400 set. Although attempt number 10 used more layers, it appears to overfit the data and therefore produce a higher RMSE.

|Attempt Number | Hidden Layers | Epochs | Activation | L1 | RMSE |
|---------------|---------------|--------|------------|----|------|
| 1 |10 |1000 |RectifierWithDropout |1.00E-02 |9.609 |
| 2 |10x10 |500 |RectifierWithDropout |1.00E-03 |10.185 |
| 3 |150x150 |300 |RectifierWithDropout |1.00E-04 |8.74 |
| 4 |200x200 |500 |RectifierWithDropout |1.00E-04 |8.567 |
| 5 |100x100x100 |1000 |RectifierWithDropout |1.00E-04 |8.545 |
| 6 |250x250 |1000 |RectifierWithDropout |1.00E-05 |8.437 |
| 7 |400x400 |1000 |RectifierWithDropout |1.00E-05 |8.41 |
| 8 |50x50x50x50 |1000 |RectifierWithDropout |1.00E-04 |10.06 |
| 9 |400x400x400 |1000 |RectifierWithDropout |1.00E-05 |8.07 |
| 10 |500x500x500x500 |1000 |RectifierWithDropout |1.00E-05 |8.14 |
| final |400x400x400 |1000 |RectifierWithDropout |1.00E-05 |8.115 |



```{r  include=FALSE, eval=FALSE}
library(h2o)
source("mlfuns.R")
sdat.all <- read.csv("SoyData2.csv", header = TRUE)
```

```{r  include=FALSE, eval=FALSE}
# pull out only the variables that have information ie `repno` is the rep number and is useless as a variable 
sdat=sdat.all[,c("YIELD","AREA","IRRIGATION","CEC","PH","ORGANIC.MATTER","CLAY","SILT_TOP","SAND_TOP","AWC_100CM","TEMP","PREC","RAD")] 

# create training, validation, and test data
set.seed(14) # my birthday
n=nrow(sdat)  # find number of obs in data

n1=floor(n/2)  # find half of number of obs (rounded) in data (for training data)
n2=floor(n/4)  # find fourth of number of obs (rounded) in data (for validataion data)
n3=n-n1-n2   # find fourth (ie 1 - 1/2 - 1/4 rounded ) of number of obs (rounded) in data (for test data)

ii = sample(1:n,n) # randomly reorders n elements
# Divide data into training set, validation set, a combinded training + validation set, and the final testing set
trainDf = sdat[ii[1:n1],]
valDf = sdat[ii[(n1+1):(n1+n2)],]
train2Df = rbind(trainDf,valDf)
testDf = sdat[ii[(n1+n2+1):n],]

```

```{r  include=FALSE, eval=FALSE}
h2oServer <- h2o.init(ip="localhost", port=54321,
max_mem_size="4g", nthreads=-1)

#create h2o dataframes
train_h2o = as.h2o(trainDf, destination_frame = "soy_train")
val_h2o = as.h2o(valDf, destination_frame = "soy_val")
train2_h2o = as.h2o(train2Df, destination_frame = "soy_train2")
test_h2o = as.h2o(testDf, destination_frame = "soy_test")

```

```{r  include=FALSE, eval=FALSE}
if(file.exists(file.path("./", "modelSoy1"))) {
modelSoy1 = h2o.loadModel(path = file.path("./", "modelSoy1"))
} else {
modelSoy1 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=10,
epochs=1000,
export_weights_and_biases=T,
l1 = 1e-2,
model_id = "modelSoy1"
)
h2o.saveModel(modelSoy1, path="./")
}
phat = predict(modelSoy1, val_h2o)
phatL = list()
phatL$modelSoy1 = as.matrix( phat[,1] )
```

```{r  include=FALSE, eval=FALSE}
### fit h2o deep
if (file.exists(file.path("./", "modelSoy2"))) {
modelSoy2 = h2o.loadModel(path = file.path("./", "modelSoy2"))
} else {
modelSoy2 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=c(10,10),
epochs=500,
activation="RectifierWithDropout",
l1=1e-3,
export_weights_and_biases=TRUE,
model_id = "modelSoy2"
)
h2o.saveModel(modelSoy2, path="./")
}
phat = predict(modelSoy2, val_h2o)
phatL$modelSoy2 = as.matrix( phat[,1] )
```

```{r  include=FALSE, eval=FALSE}
if (file.exists(file.path("./", "modelSoy3"))) {
modelSoy3 = h2o.loadModel(path = file.path("./", "modelSoy3"))
} else {
modelSoy3 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=c(150,150),
epochs=300,
activation="RectifierWithDropout",
l1=1e-4,
export_weights_and_biases=TRUE,
model_id = "modelSoy3"
)
h2o.saveModel(modelSoy3, path="./")
}
phat = predict(modelSoy3, val_h2o)
phatL$modelSoy3 = as.matrix( phat[,1] )
```

```{r  include=FALSE, eval=FALSE}
if (file.exists(file.path("./", "modelSoy4"))) {
modelSoy4 = h2o.loadModel(path = file.path("./", "modelSoy4"))
} else {
modelSoy4 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=c(200,200),
epochs=500,
activation="RectifierWithDropout",
l1=1e-4,
export_weights_and_biases=TRUE,
model_id = "modelSoy4"
)
h2o.saveModel(modelSoy4, path="./")
}
phat = predict(modelSoy4, val_h2o)
phatL$modelSoy4 = as.matrix( phat[,1] )
```

```{r  include=FALSE, eval=FALSE}
if (file.exists(file.path("./", "modelSoy5"))) {
modelSoy5 = h2o.loadModel(path = file.path("./", "modelSoy5"))
} else {
modelSoy5 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=c(100,100,100),
epochs=1000,
activation="RectifierWithDropout",
l1=1e-4,
export_weights_and_biases=TRUE,
model_id = "modelSoy5"
)
h2o.saveModel(modelSoy5, path="./")
}
phat = predict(modelSoy5, val_h2o)
phatL$modelSoy5 = as.matrix( phat[,1] )
```

```{r  include=FALSE, eval=FALSE}
if (file.exists(file.path("./", "modelSoy6"))) {
modelSoy6 = h2o.loadModel(path = file.path("./", "modelSoy6"))
} else {
modelSoy6 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=c(250,250),
epochs=1000,
activation="RectifierWithDropout",
l1=1e-5,
export_weights_and_biases=TRUE,
model_id = "modelSoy6"
)
h2o.saveModel(modelSoy6, path="./")
}
phat = predict(modelSoy6, val_h2o)
phatL$modelSoy6 = as.matrix( phat[,1] )
```

```{r  include=FALSE, eval=FALSE}
if (file.exists(file.path("./", "modelSoy7"))) {
modelSoy7 = h2o.loadModel(path = file.path("./", "modelSoy7"))
} else {
modelSoy7 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=c(400,400),
epochs=1000,
activation="RectifierWithDropout",
l1=1e-5,
export_weights_and_biases=TRUE,
model_id = "modelSoy7"
)
h2o.saveModel(modelSoy7, path="./")
}
phat = predict(modelSoy7, val_h2o)
phatL$modelSoy7 = as.matrix( phat[,1] )
```

```{r  include=FALSE, eval=FALSE}
if (file.exists(file.path("./", "modelSoy8"))) {
modelSoy8 = h2o.loadModel(path = file.path("./", "modelSoy8"))
} else {
modelSoy8 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=c(50,50,50,50),
epochs=1000,
activation="RectifierWithDropout",
l1=1e-4,
export_weights_and_biases=TRUE,
model_id = "modelSoy8"
)
h2o.saveModel(modelSoy8, path="./")
}
phat = predict(modelSoy8, val_h2o)
phatL$modelSoy8 = as.matrix( phat[,1] )
```

```{r  include=FALSE, eval=FALSE}
if (file.exists(file.path("./", "modelSoy9"))) {
modelSoy9 = h2o.loadModel(path = file.path("./", "modelSoy9"))
} else {
modelSoy9 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=c(400,400,400),
epochs=1000,
activation="RectifierWithDropout",
l1=1e-5,
export_weights_and_biases=TRUE,
model_id = "modelSoy9"
)
h2o.saveModel(modelSoy9, path="./")
}
phat = predict(modelSoy9, val_h2o)
phatL$modelSoy9 = as.matrix( phat[,1] )
```
```{r  include=FALSE, eval=FALSE}
if (file.exists(file.path("./", "modelSoy10"))) {
modelSoy10 = h2o.loadModel(path = file.path("./", "modelSoy10"))
} else {
modelSoy10 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=c(500,500,500,500),
epochs=1000,
activation="RectifierWithDropout",
l1=1e-5,
export_weights_and_biases=TRUE,
model_id = "modelSoy10"
)
h2o.saveModel(modelSoy10, path="./")
}
phat = predict(modelSoy10, val_h2o)
phatL$modelSoy10 = as.matrix( phat[,1] )
```

```{r  include=FALSE, eval=FALSE}

rmse = sqrt(mean((valDf[,1]-phatL$modelSoy1)^2))
print(rmse)
rmse = sqrt(mean((valDf[,1]-phatL$modelSoy2)^2))
print(rmse)
rmse = sqrt(mean((valDf[,1]-phatL$modelSoy3)^2))
print(rmse)
rmse = sqrt(mean((valDf[,1]-phatL$modelSoy4)^2))
print(rmse)
rmse = sqrt(mean((valDf[,1]-phatL$modelSoy5)^2))
print(rmse)
rmse = sqrt(mean((valDf[,1]-phatL$modelSoy6)^2))
print(rmse)
rmse = sqrt(mean((valDf[,1]-phatL$modelSoy7)^2))
print(rmse)
rmse = sqrt(mean((valDf[,1]-phatL$modelSoy8)^2))
print(rmse)
rmse = sqrt(mean((valDf[,1]-phatL$modelSoy9)^2))
print(rmse)
rmse = sqrt(mean((valDf[,1]-phatL$modelSoy10)^2))
print(rmse)
```


```{r include=FALSE, eval=FALSE}
if (file.exists(file.path("./", "modelFinal"))) {
modelFinal = h2o.loadModel(path = file.path("./", "modelFinal"))
} else {
modelFinal = h2o.deeplearning(
x=2:12, y=1,
training_frame=train2_h2o,
hidden=c(400,400,400),
epochs=1000,
activation="RectifierWithDropout",
l1=1e-5,
export_weights_and_biases=TRUE,
model_id = "modelFinal"
)
h2o.saveModel(modelFinal, path="./")
}
phat = predict(modelFinal, test_h2o)
phatL$modelFinal = as.matrix( phat[,1] )

rmse = sqrt(mean((testDf[,1]-phatL$modelFinal)^2))
print(rmse)
```


## Boosting: Future Work
We were curious how our 3 methods would compare to boosting with trees, so we used the `gbm` package with settings `distribution` set to gaussian, `interaction.depth` set to 4, `n.tree` set to 5000, and `shrinkage` set to 0.01. We were surprised to see that our Boosting model (with almost no tuning) was very competitive with an RMSE of $7.9159$ bushels per acre. This is a model we would like to tune in future work as it shows promise.

```{r include=FALSE, eval=FALSE}
# fit a tree using boosting on training data
boost.fit = gbm(YIELD~.,data=sdat.train,distribution="gaussian",interaction.depth=4,n.trees=5000,shrinkage=0.01)

boostvalpred=predict(boost.fit,newdata=sdat.val,n.trees=5000)  # predict using model `boost.fit` on validation data
rmse = sqrt(mean((sdat.val$YIELD-boostvalpred)^2)) # find RMSE
cat("rmse on test for boosting: ",rmse,"\n")

```


```{r include=FALSE, eval=FALSE}

#plot (out-of-sample) fits
pairs(cbind(sdat.val$YIELD,rfvalpred,boostvalpred))
print(cor(cbind(sdat.val$YIELD,rfvalpred,boostvalpred)))


#plot test y vs test predictions
plot(sdat.val$YIELD,boostvalpred)
abline(0,1,col="red",lwd=2)

#variable importance from boosting
summary(boostfit)


rmse = sqrt(mean((sdat.val$YIELD-rfvalpred)^2))
cat("rmse on test for random forests: ",rmse,"\n")
#--------------------------------------------------
#variable importance from Random Forests
varImpPlot(rf.fit)
importance(rf.fit)

#--------------------------------------------------
#refit random forests on train-val
# rffit2 = randomForest(logMedVal~.,data=catrainval,mtry=3,ntree=500)
# rftestpred = predict(rffit2,newdata=catest)
#--------------------------------------------------


```


# Conclusion

Neural Nets, KNN, and Random Forests were all fairly close in performance, and all better than Linear Regression. Our best OOB performance was from our Random Forests model with an OOB RMSE of $7.9042$ bushels per acre. Boosting with trees showed promise and should be considered in future work. 




