ML2_Project.Rmd

---
title: "Analyzing indonesian rice farms"
author: "Jonas Kernebeck, Alexander Flick, Felix Lehner"
date: "01/11/2022"
output:
  pdf_document:
    toc: yes
    toc_depth: '2'
  html_document:
    toc: yes
    toc_depth: 2
  word_document:
    toc: yes
    toc_depth: '2'
---

<!-- 
packages
install.packages("plm")
install.packages("splm")
install.packages("GGally")
install.packages("heatmaply")
install.packages("tidyverse")
install.packages("corrr")
install.packages("devtools")
devtools::install_github("hadley/productplots")

-->
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE)

```

\newpage

```{r imported libraries}
library(plm)
library(GGally)
library(heatmaply)
require('tidyverse')
library(tidyr)
library(ggmosaic)
library(gridExtra)
library(gam)
library(patchwork)
library(ggcorrplot)
library(kableExtra)
library(formattable)
library(ggplot2)
library(productplots)
library(plm)
library(glmnet)
library(leaps)
 
```

```{r constants}
data("RiceFarms", package = "plm")

col_to_remove = !names(RiceFarms) %in% c("id", "noutput")
RiceFarms = RiceFarms[,col_to_remove]
```


```{r data}
# some code here 
```

# 1 Introduction

The present data set includes multiple production data for 171 indonesian rice farms. Overall it consists of 1026 observations. The dataframe contains the following variables:

|  variable  |                   description                   |        expressions        |
|:----------:|:-----------------------------------------------:|:-------------------------:|
|     id     |           unique identifier for a farm          |         unique id         |
|    size    |        total production area in hectares        |        0.01 - 5.322       |
|   status   |            status of property rights            | "owner", "share", "mixed" |
|  varieties |                rice seed varietes               |  "trad", "high", "mixed"  |
|    bimas   |           bimas-status of the farmers           |    "no", "yes", "mixed"   |
|    seed    |                 seed in kilogram                |        1 - 1250 kg        |
|    urea    |                 urea in kilogram                |        1 - 1250 kg        |
|  phosphate |              phosphate in kilogram              |         0 - 700 kg        |
|  pesticide |             pesticide cost in Rupiah            |        0 - 62600 r        |
|    pseed   |          price of seed in Rupiah per kg         |       40 - 375 r/kg       |
|    purea   |          price of urea in Rupiah per kg         |       50 - 100 r/kg       |
|   pphosph  |       price of phosphate in Rupiah per kg       |       60 - 120 r/kg       |
| hiredlabor |               hired labor in hours              |         1 - 4536 h        |
|  famlabor  |              family labor in hours              |         1 - 1526 h        |
|  totlabor  |      total labor (excluding harvest labor)      |         1 - 4774 h        |
|    wage    |          labor wage in Rupiah per hour          |      30 - 175.35 r/h      |
|   goutput  |            gross output of rice in kg           |       42 - 20960 kg       |
|   noutput  |        gross output minus harvesting cost       |       42 - 17610 kg       |
|    price   |       price of rough rice in Rupiah per kg      |       50 - 190 r/kg       |
|   region   |                region of the farm               |       unique region       |

As present in the table, the data set consists of 16 numeric variables and 4 categorical variables. The target variable for the regression modeling will be *goutput*, what represents the gross output of rice in *kg* for the respective rice farm.
In the following some explorative data analysis will be made to get to get a first impression of the distribution of the individual variables.

## 1.1 Numerical Variables
The following figure shows boxplots for the used materials and the prices paid for the materials of the respective rice farms. The boxplots for the materials show, that the distribution of all materials is right-skewed. The spread width of seed is the lowest, followed by phosphate and urea. Therefore *urea* also has the highest variance with 16166 followed by *phosphate* with 2264 and *seed* with 2048. The distribution of *urea* indicates that rice farms in Indonesia may use urea very different, caused by e.g. the bimas-status. The bimas program is a rice intensification program by the government to support local rice production by providing high-yield rice seeds as well as technical assistance. 
&nbsp;
If we look at the prices for phosphate *pphosph* and urea *purea*, we can see a slight left-skewed distribution with low variance (75 for *purea* and 86 for *pphosph*). In contrast to that, the prices for seeds scatter much. The distribution of *pseed* is strongly right-skewed as well as the distribution for the rice price *price*. The price for the rice also scatters, but less than *pseed*. The two prices have a correlation of 0.67. Of course, the price of seeds affects the selling price of rice. The prices may fluctuate due to seasonal or regional factors and have an impact on each other.
&nbsp;
The distribution of labor hours is also slightly skewed to the right. Overall, the dispersion is lowest for the *famlabor*. For *hiredlabor* and *totlabor* we have a similar spread, but *totlabor* has a higher level overall. This is caused by the *hiredlabor* which is a subset of *totlabor*. 


\newpage


```{r, fig.width=10, fig.height=9}

df_boxplot_materials = data.frame(
  feature=c(rep("seed", 1026), rep("urea", 1026),rep("phosphate", 1026)), 
  material=c(RiceFarms$seed, RiceFarms$urea, RiceFarms$phosphate))


bpl_materials = ggplot(df_boxplot_materials, aes(x=feature, y=material, fill=feature)) + 
  geom_boxplot() + 
  ylim(0, 600) + 
  ggtitle('Distribution of materials') + 
  theme(plot.title = element_text(hjust = 0.5))


df_boxplot_price_materials = data.frame(
  feature=c(rep("price seed", 1026), rep("price urea", 1026),rep("price phosphate", 1026), rep("price rice", 1026)), 
  rupiah_per_kg=c(RiceFarms$pseed, RiceFarms$purea, RiceFarms$pphosph, RiceFarms$price))


bpl_price_materials = ggplot(df_boxplot_price_materials, aes(x=feature, y=rupiah_per_kg, fill=feature)) + 
  geom_boxplot() + 
  ggtitle('Distribution of material prices') + 
  theme(plot.title = element_text(hjust = 0.5))
  # + ylim(0, 400)

df_boxplot_labor = data.frame(
  labor=c(rep("hired labor", 1026), rep("family labor", 1026),rep("total labor", 1026)), 
  hours=c(RiceFarms$hiredlabor, RiceFarms$famlabor, RiceFarms$totlabor))


bpl_labor = ggplot(df_boxplot_labor, aes(x=labor, y=hours, fill=labor)) + 
  geom_boxplot() + 
  ggtitle('Distribution of labor hours') + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  ylim(0,500)


grid.arrange(bpl_materials, bpl_price_materials, bpl_labor, ncol=1, nrow =3)

```

\newpage

## 1.1 Categorical Variables

The following mosaic plot shows the distribution of of the categorical variables *varietes*, *region* and *bimas*. Overall, all regions are roughly equally represented in the data set. We can detect, that most of the farmers with the bimas status *yes* and *mixed* are located in the region *ciwangi*. The distribution of the different varieties is strongly dependent on the region. While the *high* varieties have the biggest share in the regions *wargabinangun* and *langan*, the *traditional* varieties are dominating the regions *gunungwangi*, *malausma* and *ciwangi*. The *mixed* varities are only used slightly in all regions.
\
\
\


```{r, fig.width=7, fig.height=5}

df_cat = RiceFarms[,c("region", "varieties", "bimas")]

test = ggplot(df_cat) +
  ggtitle("Mosaic Plot for varieties, region and bimas") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_mosaic(aes(x = product(varieties, region), fill=bimas, offset = 0.2))

breaks = ggplot_build(test)$layout$panel_params[[1]]$x$get_breaks()
labels=c("wargabinangun","", "","","langan", "","gunungwangi","", "",
         "malausma","", "","sukaambit","", "",
         "ciwangi","","")

ggplot(df_cat) +
  ggtitle("Mosaic Plot for varieties, region and bimas") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_mosaic(aes(x = product(varieties, region), fill=bimas, offset = 0.2)) +
  scale_x_productlist(breaks=breaks, labels=labels) + 
  xlab("region")
  

```

\
\
\

To test wheter the categorical variables have impact on our target variable *goutput*, one- and two-sided anovas are performed. The results of these are summarized in the following table:
\
\


|         formula        | F-value |  p-value | significant |
|:----------------------:|:-------:|:--------:|:-----------:|
|         region         |  22.981 |  < 2e-16 |     yes     |
|        varieties       |  11.764 | 8.94e-06 |     yes     |
|          bimas         |  14.817 | 4.57e-07 |     yes     |
|     region+varietes    |  3.847  | 3.96e-05 |     yes     |
|      region+bimas      |  5.651  | 2.94e-08 |     yes     |
|     varieties+bimas    |  0.791  |   0.531  |      no     |
| region+varieties+bimas |  0.860  |   0.580  |      no     |


The anova outputs show, that all of the categorical variables have a significant effect on *goutput*. The null hypothesis, that the mean of *goutput* is the same across the groups is rejected. The results of the two-sided anovas also show a significant interaction effect on *goutput*. While the interaction effect from the *region* with *varieties* and *bimas* is significant, the interaction effect of *varieties* and *bimas* and the interaction effect of all three variables is not.

## 1.3 Variable selection and transformation

The performance of the regression modeling is highly dependent of the variable selection and transformation. Therefore a suitable choice is very important. 
The variable *noutput* is a linear transformation of *goutput* as it represents *goutput* decreased by the harvesting costs. Therefore it is not used for the modeling because it would violate the multicollinearity assumption. \
The variable *size* also correlates *strongly* with the target variable. This can be intuitively explained by the fact that a larger rice field naturally always produces a higher yield. Since the variables *seed*, *urea*, *phosphate* and *pesticide* are dependent on size, they are transformed into per-hectare sizes by dividing them with the respective hectare size of the farm. \
The variables *famlabor* and *hiredlabor* are subsets of the variable *totlabor* and are therefore transformed into the share of *totlabor* by dividing them with the amount of *totlabor*. The variable *totlabor* is after that transformed to a per-hectare size by dividing it with the *size*.
The variable *wage* follows a bimodal distribution. Therefore it is transformed into a binary variable, which indicates if the respective value is over or under 100. 

## 1.4 Model evaluation

The data set will be splitted in 60-20-20 parts, where 60% of the data is used for training the model, and 20% for testing and validating respectively. In the modeling part, also cross-validation is used. To evaluate the models and compare them, different metrics will be used. The numeric metrics used are the *MSE*, which stands for the mean squared error and the *AIC*, which stands for the Akaike information criterion. Beside these metrics, also graphical analysis plots like a residual plots are used for evaluation.

```{r}
#reorder levels
factor_bimas <- c("no", "mixed", "yes")
RiceFarms$bimas <- factor(RiceFarms$bimas, levels = factor_bimas)

factor_varieties <- c("trad", "mixed", "high")
RiceFarms$varieties <- factor(RiceFarms$varieties, levels = factor_varieties)
```

3 Outliers most likely due to typos for observation 110, 947 and 1004 have been identified and manually adjusted.

```{r echo=FALSE}
#outliers
#RiceFarms[RiceFarms$seed>1000,]
#RiceFarms[RiceFarms$id==102220,]
RiceFarms$seed[110] <- 125 #instead of 1250

#RiceFarms[RiceFarms$pesticide>60000,]
#RiceFarms[RiceFarms$id==607168,]
RiceFarms$pesticide[947] <- 6260 #instead of 62600

#outlier?
#RiceFarms[RiceFarms$totlab_size>14000,]
#RiceFarms[RiceFarms$id==609241,]
RiceFarms$size[1004] <- 0.1 #instead of 0.01
```


```{r}
#divide by size
RiceFarms$seed_size <- RiceFarms$seed/RiceFarms$size
RiceFarms$phosph_size <- RiceFarms$phosphate/RiceFarms$size
RiceFarms$urea_size <- RiceFarms$urea/RiceFarms$size
RiceFarms$totlab_size <- RiceFarms$totlabor/RiceFarms$size
RiceFarms$pest_size <- RiceFarms$pesticide/RiceFarms$size
```


```{r}
RiceFarms$fam_ratio <- RiceFarms$famlabor/RiceFarms$totlabor
```


```{r}
#wage
#plot(wage,goutput)
RiceFarms$wage_cat <- ifelse(RiceFarms$wage > 100, "<100", ">100")
RiceFarms$wage_cat <- factor(RiceFarms$wage_cat)
```


```{r}
#Split the data into a 60-20-20 split for training, validation and testing.
set.seed(20211207)
n <- nrow(RiceFarms)
train <- sample(1:n,0.8*n)
test <- setdiff(1:n,train)
set.seed(20211207)
val <- sample(train,length(test))
train <- setdiff(train,val)
```

```{r}
MSE <- function(model, data, split=train){
  pred.split <- predict(model, newdata = data[split,])
  return(mean((exp(pred.split)-RiceFarms$goutput[split])^2))
}
```

\newpage


# 2 First Model

The first model to present and evaluate is called Lasso Regression.

## 2.1 Lasso Regression

As in most of the regression types, it minimizes the residual sum of squares (short: RSS). But in addition to that a penalty term is included in the formula which shrinks some parameters coefficient estimates to zero.
$$
R S S+\lambda \sum_{j=1}^{p}\left|\widehat{\beta}_{j}\right|
$$
,where $\lambda$ is a hyperparameter, p are the parameters and $\beta$ the parameter coefficient estimates. The sum of the coefficents starts at $j=1$ because the $\beta_0$ is no parameter coefficient estimate, it is the bias of the model. Furthermore the absolute values of the estimates are calculated and summarized. As you can see from the equation, the lasso regression turns into a simple linear regression if the $\lambda$ is zero. \newline
The goal and intention of the lasso regression is to create a sparse model which makes it easier to interpret. 

## 2.2 Feature Selection

Before starting with the actual regression, we can investigate which of the features could be important for predicting the goutput of the data. This step is helpful as the number of features in the dataset is over twenty. \newline
The method used for selecting the variables is the forward selection. The package leaps provides therefore a function regsubsets for linear models. It takes the target variable and the features of the dataset as input and gives some hints which variables could fit best to predict the goutput. The used algorithm optimizes the Mallows $C_p$ statistics which is related to the AIC. (James,p.79) \newline
There are more input parameters available for the function, i.e. the maximum size of a subset, the weight vector, the number of the best subsets or the method of variable selection (i.e. forward selection, backward selection, etc.). \newline
The following graphics show the results of the analysis.  
```{r}
feature_selection<-function(data){
  regfit.full<-regsubsets(goutput~.,data,nvmax=15)
  reg.sum<<-summary(regfit.full)
  
  dat<-data.frame(rss=reg.sum$rss, adjr2=reg.sum$adjr2,cp=reg.sum$cp,bic=reg.sum$bic)
  
  rss_plot<- ggplot(dat)+
    aes(x=seq(1:length(rss)),y=rss)+
    geom_line()+
    xlab('Number of Variables')+
    ylab('RSS')
  
  adj_r2_max = which.max(reg.sum$adjr2) # 11

  adj_r_plot<-ggplot(dat)+
    aes(x=seq(1:length(adjr2)),y=adjr2)+
    geom_line()+
    geom_point(aes(x=adj_r2_max,y=adjr2[adj_r2_max]))+
    xlab('Number of Variables')+
    ylab('Adjusted RSq')
  
  cp_min = which.min(dat$cp) 

  cp_plot<-ggplot(dat)+
    aes(x=seq(1:length(cp)),y=cp)+
    geom_line()+
    geom_point(aes(x=cp_min,y=cp[cp_min]))+
    xlab('Number of Variables')+
    ylab('Cp')
  
  bic_min = which.min(reg.sum$bic)

  bic_plot<-ggplot(dat)+
    aes(x=seq(1:length(bic)),y=bic)+
    geom_line()+
    geom_point(aes(x=bic_min,y=bic[bic_min]))+
    xlab('Number of Variables')+
    ylab('BIC')
  
  patch<-(rss_plot + adj_r_plot)/(cp_plot + bic_plot)
  
  patch + plot_annotation(title='Feature Selection')
}


####3. 
feature_selection(RiceFarms[train,])
```
**Explanation:**
The graphic show the the RSS, the adjusted $R^2$, the $C_p$ statistics and the BIC of the models. \newline
The metrics help to identify the overall best models of the problem. Each of the metrics show that in general a 3-variable model could be enough as the metrics are getting slightly better as the number of variables increase. Just for the BIC it can be seen that after 7 variables the metric gets worse.\newline
By investigating the output of the regsubsets, it shows which of the variables are selected to give the best results.

```{r}
#reg.sum
```
The best 3-variable model is selecting size, phosphate and totlabor as the best performing variables. These variables are also in every other greater model. Therefore they are remembered when searching for a optimal model for the lasso regression. Also some other variables like pesticide, variety, wage and region are likely to have an influence on the model. \newline

## 2.3 Preprocessing

As a specific package for the lasso regression, the glmnet package, is used, the data needs to be preprocessed. One important step therefore is to transform the qualititative variables like factor variables to dummy variables so the model can use them. The method is relatively simple by creating extra features with binary values. model.matrix is doing this transformation automatically by creating a design matrix out of the data frame. It also needs the information which variables to transform for which an expression is needed. This gives the opportunity to select the wished variables and also to do mathematical transformation like logarithm or polynomial conversion before applying and getting the model matrix for the model.\newline
In addition to that a scaling option is implemented to test if the model is better when scaling the data. 

```{r}
prepare_x<-function(expr,scaler=FALSE,center=FALSE){
  if(scaler){
    x<-scale(model.matrix(expr,RiceFarms)[,-1],center=center) #creates a matrix and transforms qualitative variables to dummy variables 
  }
  else{
    x<-model.matrix(expr,RiceFarms)[,-1] #creates a matrix and transforms qualitative variables to dummy variables 
  }
  return(x)
}
x<-prepare_x(goutput~.)
y<-log(RiceFarms$goutput)
```

## 2.4 Training and Evaluation


As the input features are transformed properly the lasso regression can be trained. After training the function returns a list of models. This is because of the hyperparameter $\lambda$. This allows us to fine tune the model and improve its performance. To do this, a simple 10-fold-cross-validation on the validation data is applied which is also included in the package glmnet. But before starting the validation a grid of possible lambdas is prepared to have the possibility to change the spectrum of the lambda parameter. Afterwards a plot is obtained showing the mean cross-validated error depending on the $\lambda$. The dotted lines in the plot are the $\lambda$ which minimizes the mean cross-validated error the most and the $\lambda$ which gives most regularized model such that the cross-validated error is within one standard error. In this task the $\lambda$ with the minimum mean cross-validated error is chosen. \newline
To evaluate the selected model on the validation data with the chosen $\lambda$ the metrics MSE, BIC, AIC, AICc (modification of AIC as a correction for small sample sizes) and $R^2$ are used. \newline
As for the glmnet package no known implemented function is found the metrics BIC,AIC, AICc and $R^2$ are implemented therefore manually. The formulas used are:

$$
\mathrm{BIC}=\chi^{2}+k \ln (n)
$$
with 
$$
\mathbf{X}^{2}=\text { Null deviance }-\text { Residual deviance }
$$
and k as the number of parameters estimated by the model and n the number of observations.
$$
\mathrm{AIC}=-\chi^{2}+ 2k
$$

$$
\mathrm{AICc}= AIC+\frac{2k(k+1)}{n-k-1}
$$
and 
$$
\begin{aligned}
R^{2} &=1-\frac{\text { sum squared regression (SSR) }}{\text { total sum of squares (SST) }} \\
&=1-\frac{\sum\left(y_{i}-\hat{y}_{i}\right)^{2}}{\sum\left(y_{i}-\bar{y}\right)^{2}}
\end{aligned}
$$
where $y$ are the actual values, $\hat{y}$ the predicted values and $\bar{y}$ the mean value of the $y$. 
```{r}
BICAIC<-function(fit){
  tLL <- fit$nulldev - deviance(fit)
  k <- fit$df
  n <- fit$nobs
  AICc <- -tLL+2*k+2*k*(k+1)/(n-k-1)
  AIC_ <- -tLL+2*k
  
  BIC<-log(n)*k - tLL
  return(list('AIC'=AIC_,'BIC'=BIC,'AICc'=AICc))
}
```

```{r}

prepare_grid<-function(grid_param=c(0,-4,100)){
  grid <- 10^ seq(grid_param[1],grid_param[2], length =grid_param[3])
  return(list('grid'=grid))
}

train_lasso_with_feature<-function(x,y,grid){
  
  #Fit train and predict on test set to get the MSE 
  lasso.mod <-glmnet(x[train,],y[train],alpha=1, lambda =grid , thresh =1e-12)
  lasso.pred<-predict(lasso.mod ,s=1, newx=x[val,])#s is the lambda because we have a grid of lasso.models
  #cat('----MSE----:',mean((lasso.pred -y[val])^2))#MSE 
  
  return(lasso.mod)
}

train_lasso_cv<-function(lasso.mod,x,y,grid){
  ##Use Cross-Validation to find the best lambda 
  set.seed(1) 
  cv.out <-cv.glmnet(x[val,],y[val],lambda=grid,alpha =1) #does 10-fold-CV as default 
  #plot(cv.out) 
  bestlam <-cv.out$lambda.min

  best.model<-glmnet(x[train,],y[train],alpha=1, lambda =bestlam , thresh =1e-12)
  res<-BICAIC(best.model)
  
 #Now what is the test MSE with this best lambda
  lasso.pred.train<-predict(lasso.mod ,s=bestlam ,newx=x[train,]) 
  lasso.pred<-predict(lasso.mod ,s=bestlam ,newx=x[val,]) 
  mse.train<-mean((lasso.pred.train -y[train])^2)
  mse.val<-mean((lasso.pred -y[val])^2)
  rss<-(lasso.pred-y[val])^2
  #plot(rss)
  sst <- sum((y[val] - mean(y[val]))^2)
  sse <- sum((lasso.pred - y[val])^2)
  
  #find R-Squared
  rsq <- 1 - sse/sst
  return(list('bestlam'=bestlam,'BIC'=res,'RSS'=rss,'RSq'=rsq,'FITTED'=lasso.pred,'df'=best.model$df,'mse.val'=mse.val,'mse.train'=mse.train))
}
```


## 2.5 Model Selection

After setting the environment and the criterions of the training and evaluating of models, the best model can be selected out of the possible feature subsets. As indicated at the beginning of the Feature Selection section, the variable size, totlabor and phosphate are used to start with. However, this is also done sequentially to obtain the metrics of each model. For the other variables mentioned, this preselection is continued. Additional mathematical transformations of the variables are also considered and included. In order to provide as much variation as possible, the variables are also compared with their size-scaled correspondents. The results of the evaluation are stored in a table and assessed. 
```{r}
prep<-prepare_grid()

x<-prepare_x(goutput~log(size)+log(seed)) 
lasso.model.1<-train_lasso_with_feature(x,y,prep$grid)
fitted.1<-train_lasso_cv(lasso.model.1,x,y,prep$grid)
```
```{r}
x<-prepare_x(goutput~log(size)+seed)
lasso.model.2<-train_lasso_with_feature(x,y,prep$grid)
fitted.2<-train_lasso_cv(lasso.model.2,x,y,prep$grid)
```
```{r}
x<-prepare_x(goutput~log(size)+seed_size)
lasso.model.3<-train_lasso_with_feature(x,y,prep$grid)
fitted.3<-train_lasso_cv(lasso.model.3,x,y,prep$grid)
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size))
lasso.model.4<-train_lasso_with_feature(x,y,prep$grid)
fitted.4<-train_lasso_cv(lasso.model.4,x,y,prep$grid)
####log(seed_size) better
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+log(totlabor))
lasso.model.5<-train_lasso_with_feature(x,y,prep$grid)
fitted.5<-train_lasso_cv(lasso.model.5,x,y,prep$grid)
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlabor)
lasso.model.6<-train_lasso_with_feature(x,y,prep$grid)
fitted.6<-train_lasso_cv(lasso.model.6,x,y,prep$grid)
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+log(totlab_size))
lasso.model.7<-train_lasso_with_feature(x,y,prep$grid)
fitted.7<-train_lasso_cv(lasso.model.7,x,y,prep$grid)
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size)
lasso.model.8<-train_lasso_with_feature(x,y,prep$grid)
fitted.8<-train_lasso_cv(lasso.model.8,x,y,prep$grid)
###normal totlab_size better
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+urea)
lasso.model.9<-train_lasso_with_feature(x,y,prep$grid)
fitted.9<-train_lasso_cv(lasso.model.9,x,y,prep$grid)
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+log(urea))
lasso.model.10<-train_lasso_with_feature(x,y,prep$grid)
fitted.10<-train_lasso_cv(lasso.model.10,x,y,prep$grid)
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+urea_size)
lasso.model.11<-train_lasso_with_feature(x,y,prep$grid)
fitted.11<-train_lasso_cv(lasso.model.11,x,y,prep$grid)
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+log(urea_size))
lasso.model.12<-train_lasso_with_feature(x,y,prep$grid)
fitted.12<-train_lasso_cv(lasso.model.12,x,y,prep$grid)
####log(urea) better
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+log(urea)+wage_cat)
lasso.model.13<-train_lasso_with_feature(x,y,prep$grid)
fitted.13<-train_lasso_cv(lasso.model.13,x,y,prep$grid)
####better than before
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+log(urea)+phosphate)
lasso.model.14<-train_lasso_with_feature(x,y,prep$grid)
fitted.14<-train_lasso_cv(lasso.model.14,x,y,prep$grid)
####worse than before
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed)+log(totlabor)+log(urea)+log(phosphate+1))
lasso.model.15<-train_lasso_with_feature(x,y,prep$grid)
fitted.15<-train_lasso_cv(lasso.model.15,x,y,prep$grid)
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed)+log(totlabor)+log(urea)+phosph_size)
lasso.model.16<-train_lasso_with_feature(x,y,prep$grid)
fitted.16<-train_lasso_cv(lasso.model.16,x,y,prep$grid)
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed)+log(totlabor)+log(urea)+log(phosph_size+1))
lasso.model.17<-train_lasso_with_feature(x,y,prep$grid)
fitted.17<-train_lasso_cv(lasso.model.17,x,y,prep$grid)
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+log(urea)+wage_cat+region)
lasso.model.18<-train_lasso_with_feature(x,y,prep$grid)
fitted.18<-train_lasso_cv(lasso.model.18,x,y,prep$grid)
####better than before
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+log(urea)+wage_cat+region+pesticide)
lasso.model.19<-train_lasso_with_feature(x,y,prep$grid)
fitted.19<-train_lasso_cv(lasso.model.19,x,y,prep$grid)
```
```{r}

####better
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+log(urea)+wage_cat+region+log(pesticide+1))
lasso.model.20<-train_lasso_with_feature(x,y,prep$grid)
fitted.20<-train_lasso_cv(lasso.model.20,x,y,prep$grid)
####slightly better
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+log(urea)+wage_cat+region+pest_size)
lasso.model.21<-train_lasso_with_feature(x,y,prep$grid)
fitted.21<-train_lasso_cv(lasso.model.21,x,y,prep$grid)
###better
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+log(urea)+wage_cat+region+log(pest_size+1))
lasso.model.22<-train_lasso_with_feature(x,y,prep$grid)
fitted.22<-train_lasso_cv(lasso.model.22,x,y,prep$grid)
###worse
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+log(urea)+wage_cat+region+pest_size+price)
lasso.model.23<-train_lasso_with_feature(x,y,prep$grid)
fitted.23<-train_lasso_cv(lasso.model.23,x,y,prep$grid)
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+log(urea)+wage_cat+region+pest_size+log(price))
lasso.model.24<-train_lasso_with_feature(x,y,prep$grid)
fitted.24<-train_lasso_cv(lasso.model.24,x,y,prep$grid)
####slightly better
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+log(urea)+wage_cat+region+pest_size+log(price)+fam_ratio)
lasso.model.25<-train_lasso_with_feature(x,y,prep$grid)
fitted.25<-train_lasso_cv(lasso.model.25,x,y,prep$grid)
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+log(urea)+wage_cat+region+pest_size+log(price)+I(sqrt(-log(fam_ratio))))
lasso.model.26<-train_lasso_with_feature(x,y,prep$grid)
fitted.26<-train_lasso_cv(lasso.model.26,x,y,prep$grid)
####worse
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+log(urea)+wage_cat+region+pest_size+log(price)+varieties)
lasso.model.27<-train_lasso_with_feature(x,y,prep$grid)
fitted.27<-train_lasso_cv(lasso.model.27,x,y,prep$grid)
####better
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+log(urea)+wage_cat+region+pest_size+log(price)+varieties+bimas)
lasso.model.28<-train_lasso_with_feature(x,y,prep$grid)
fitted.28<-train_lasso_cv(lasso.model.28,x,y,prep$grid)
####worse
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+log(urea)+wage_cat+region+pest_size+log(price)+varieties+purea)
lasso.model.29<-train_lasso_with_feature(x,y,prep$grid)
fitted.29<-train_lasso_cv(lasso.model.29,x,y,prep$grid)
####better
```
```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+log(urea)+wage_cat+region+pest_size+log(price)+varieties+log(purea))
lasso.model.30<-train_lasso_with_feature(x,y,prep$grid)
fitted.30<-train_lasso_cv(lasso.model.30,x,y,prep$grid)
####better

```

```{r}
options(tinytex.verbose = TRUE)

lassos <- seq(1:30)

l <- length(lassos)
var <- rep(NA, l)
df <- rep(NA, l)
MSE.train <- rep(NA, l)
MSE.val <- rep(NA, l)
aic <- rep(NA, l)
aicc<-rep(NA, l)
bic<- rep(NA, l)
r_sq<- rep(NA, l)

for (i in 1:l){
  fitted<-get(eval(paste0('fitted.',lassos[[i]],sep='')))
  model.name<-get(eval(paste0('lasso.model.',lassos[[i]],sep=''))) 
  var[i] <- dimnames(tail(coefficients(model.name,s=fitted$bestlam), n=1))[[1]]
  df[i] <-  fitted$df
  MSE.train[i] <- round(fitted$mse.train,digits=4)
  MSE.val[i] <- round(fitted$mse.val,digits=4) 
  aic[i] <- round(fitted$BIC$AIC,1)
  aicc[i] <- round(fitted$BIC$AICc,1)
  bic[i] <- round(fitted$BIC$BIC,1) 
  r_sq[i] <- round(fitted$RSq,digits=4)
}

model_perf <- data.frame(var, df, MSE.train, MSE.val, aic, aicc, bic, r_sq)
#colnames(model_perf) <- c("var","df","MSE.train","MSE.val","dev","aic","p_val_p", "p_val_np", "df_np")

#model_perf$var[17]<- "s(pphosph)+s(purea)"
#model_perf$var[22:23] <- c("bimas", "bimas+varieties", "bimas+status", "bimas+region")
#model_perf$var[25] <- "s(purea)+varieties"

kbl(model_perf, caption = "Lasso comparison for variable selection", longtable = TRUE,booktabs=TRUE) %>%
  kable_styling(latex_options = c("hold_position", "repeat_header","striped"),stripe_index=c(5:8,13,18,23:24,27,29:30)) %>%
  kable_paper("hover", full_width = FALSE) %>%
  pack_rows("size+seed", 1, 4,latex_gap_space="1em") %>%
  pack_rows("size+seed+totlabor", 5, 8,latex_gap_space="1em") %>%
  pack_rows("size+seed+totlabor+urea", 9, 12,latex_gap_space="1em") %>%
  pack_rows("size+seed+totlabor+urea+wage_cat", 13, 13,latex_gap_space="1em") %>%
  pack_rows("size+seed+totlabor+urea+phosphate", 14, 17,latex_gap_space="1em") %>%
  pack_rows("size+seed+totlabor+urea+wage_cat+region", 18, 18,latex_gap_space="1em") %>%
  pack_rows("size+seed+totlabor+urea+wage_cat+region+pesticide", 19, 22,latex_gap_space="1em") %>%
  pack_rows("size+seed+totlabor+urea+wage_cat+region+pesticide+price", 23, 24,latex_gap_space="1em") %>%
  pack_rows("size+seed+totlabor+urea+wage_cat+region+pesticide+price+fam_ratio", 25, 26,latex_gap_space="1em") %>%
  pack_rows("size+seed+totlabor+urea+wage_cat+region+pesticide+price+varieties", 27, 27,latex_gap_space="1em") %>%
  pack_rows("size+seed+totlabor+urea+wage_cat+region+pesticide+price+varieties+bimas", 28, 28,latex_gap_space="1em") %>%
  pack_rows("size+seed+totlabor+urea+wage_cat+region+pesticide+price+varieties+purea", 29, 30,latex_gap_space="1em") 


```

\newpage

**Table columns:**

* var: the variable that is added
* df: degrees of freedom 
* MSE.train: MSE on training data
* MSE.val: MSE on validation data
* aic: Akaike information criterion
* aicc: Akaike information criterion + penalty term
* bic: Bayesian information criterion
* r_sq: R-squared


As described in the feature selection, the RSS and the $R^2$ are improving with the increase of features selected. However the other metrics (AIC,BIC and AICc) are behaving like the feature selection predicts. By increasing the number of features after approximately 6 the metrics are generally getting worse. This means that the number of features does not justify the improvement of the RSS and the $R^2$. The penalty for the more complex model is higher than it has a positive effect. Therefore the smaller model is preferred. \newline
```{r}
lassos <- seq(1:30)

l <- length(lassos)
var <- rep(NA, l)
df <- rep(NA, l)
MSE.train <- rep(NA, l)
MSE.val <- rep(NA, l)
aic <- rep(NA, l)
aicc<-rep(NA, l)
bic<- rep(NA, l)
r_sq<- rep(NA, l)

for (i in 1:l){
  fitted<-get(eval(paste0('fitted.',lassos[[i]],sep='')))
  model.name<-get(eval(paste0('lasso.model.',lassos[[i]],sep=''))) 

  var[i] <- dimnames(tail(coefficients(model.name,s=fitted$bestlam), n=1))[[1]]
  MSE.train[i] <- round(fitted$mse.train,digits=4)
  MSE.val[i] <- round(fitted$mse.val,digits=4) 
  bic[i] <- round(fitted$BIC$BIC,1) 
  aicc[i] <- round(fitted$BIC$AICc,1)
  aic[i] <- round(fitted$BIC$AIC,1)
  r_sq[i] <- round(fitted$RSq,digits=4)
}

model_perf <- data.frame(MSE.train, MSE.val, bic,aic,aicc,r_sq, var,row.names = var)

df.long <- pivot_longer(model_perf, cols=1:6, names_to = "metric", values_to = "value")

df.long$var <- factor(df.long$var, levels =var)

mse_plot <- ggplot(df.long[df.long$metric %in% c('MSE.train','MSE.val'),])+
  aes(x= var, y=value, group=metric, color = metric)+
  geom_line()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
        axis.title.x = element_blank(), legend.position = c(0.8, 0.8))

bic_plot <- ggplot(df.long[df.long$metric %in% c("bic","aic","aicc"),])+
  aes(x=var, y=value, group=metric, color=metric)+
  geom_line()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
        axis.title.x = element_blank(), legend.position = c(0.8, 0.8))

r_sq_plot<-ggplot(df.long[df.long$metric=="r_sq",])+
  aes(x=var, y=value, group=1)+
  geom_line()+
  ylab("R-squared")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
        axis.title.x = element_blank())
  
patch1 <- r_sq_plot + mse_plot

patch1 + plot_annotation(
  title = 'Model performance'
)
bic_plot + plot_annotation(title='Model metrics')
```
The best model to be chosen is: 

$$
log(g) = \beta_1*log(s) + \beta_2*log(e) + \beta_3*t + \beta_4 *log(u) + \beta_5 * w + \beta_0
$$
where :

* g: goutput 
* s: size
* e: seed
* t: totallabor
* u: urea
* w: wage_cat
* $\beta_i$: coefficients and intercept 

The best model also allows to look at its lambdas and the coefficients.

```{r}
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+log(urea)+wage_cat)
lasso.model.13<-train_lasso_with_feature(x,y,prep$grid)
fitted.13<-train_lasso_cv(lasso.model.13,x,y,prep$grid)

beta=coef(lasso.model.13)

tmp <- as.data.frame(as.matrix(beta))
tmp$coef <- row.names(tmp)
tmp <- reshape::melt(tmp, id = "coef")
tmp$variable <- as.numeric(gsub("s", "", tmp$variable))
tmp$lambda <- lasso.model.13$lambda[tmp$variable+1] # extract the lambda values
tmp$norm <- apply(abs(beta[-1,]), 2, sum)[tmp$variable+1] # compute L1 norm

# x11(width = 13/2.54, height = 9/2.54)
ggplot(tmp[tmp$coef != "(Intercept)",], aes(lambda, value, color = coef, linetype = coef)) + 
    geom_line() + 
  geom_vline(xintercept=fitted.13$bestlam)+
    scale_x_log10() + 
    xlab("Lambda (log scale)") + 
    guides(color = guide_legend(title = ""), 
           linetype = guide_legend(title = "")) +
    theme_bw() + 
    theme(legend.key.width = unit(3,"lines"))
```


\newpage
# 3 Second Model

The second model to present and evaluate is called Generalized Additive Model (GAM)

## 3.1 Generalized Additive Model (GAM)

Generalized additive models (GAMs) provide a general framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity. Just like linear models, GAMs additivity can be applied with both quantitative and qualitative responses.
GAMs allow us to fit a non-linear predictor $f_{j}$ to each variable $x_{ij}$, so that we will find non-linear relationships that standard multiple linear regression will miss. We do not need to manually try out many different transformations on each variable individually. The general GAM formula is as follows:

$$
Y_{i}=\beta_{0}+f_{1}\left(x_{i 1}\right)+f_{2}\left(x_{i 2}\right)+\ldots+f_{p}\left(x_{i p}\right)+\epsilon_{i}
$$

There is no constraint that each $f_{j}$ has to be the same type of function.
So $f_{1}$ could be a quadratic function, $f_{2}$ a smoothing spline function and $f_{3}$ a
loess function. And the smoothness of the function $f_{j}$ for the variable $X_{j}$ can be controlled independently for each variable.

* **Regression splines** are more flexible than polynomials and step functions, and in fact are an extension of the two. They involve dividing the range of X into K distinct regions. Within each region, a polynomial function is fit to the data. However, these polynomials are constrained so that they join smoothly at the region boundaries, or knots.
* **Smoothing splines** are similar to regression splines, but arise in a slightly different situation. Smoothing splines result from minimizing a residual sum of squares criterion subject to a smoothness penalty.
* **Local regression** is similar to splines, but differs in an important way. The regions are allowed to overlap, and indeed they do so in a very smooth way.

A GAM is restricted to be additive. Important interactions might be missed, but we can manually add an interaction term to the GAM model by adding a predictor for $X_{j}X_{k}$. Fitting a GAM with a smoothing spline is not quite as simple as fitting a GAM with a natural spline, since in the case of smoothing splines, least squares cannot be used. However, standard software such as the gam() function from the gam library in R can be used to fit GAMs using smoothing splines, via an approach known as backfitting. So there is an important difference between smoothing splines and naturals splines. In the first case what you are fitting is a penalized spline model while in the second just regression splines, i.e. splines without penalty.

The s() function, which is part of the gam library, is used to indicate that we would like to use a smoothing spline. Qualitative variables are automatically converted into dummy variables by the gam function according to the amount of levels they have.

**Explanation of GAM visualization**

The lower visualization is an example of our 2nd GAM including two variables, size and totlab_size. For both of them we are using a smoothing spline function together with a log transformation. Because the model is additive, we can examine the effect of each $f_{j}$ on $Y$ individually while holding all of the other variables fixed. Both panels from below have the same vertical scale. This allows us to visually assess the relative contributions of each of the variables. We observe that size and totlab_size have a large effect on goutput. 
The left-hand panel indicates that holding totlab_size fixed, goutput increases with size and is very steep up to about a size of 0.5 hectar and then becomes more and more flat.
The right-hand panel indicates that holding size fixed, goutput increases drastically with the increased proportion of labour per size up to a proportion of about 500 hours/hectar and then flattens out. We can also see from the stripchart on the x-axis of the panel that most of the data are of smaller sizes up to 1 hectar and between 300 to 2000 hours of labor invested per hectar.
```{r}
#size
gam1 <- gam(log(goutput)~s(size), data = RiceFarms[train,])
gam2 <- gam(log(goutput)~s(log(size)), data = RiceFarms[train,])
```

```{r}
#par(mfrow=c(1,1))
#plot.Gam(gam2, residuals = TRUE, col="lightblue")
```

```{r fig.asp=0.5}
#labour
gam3 <- gam(log(goutput)~s(log(size))+s(log(totlab_size)), data = RiceFarms[train,])
gam4 <- gam(log(goutput)~s(log(size))+s(totlab_size), data = RiceFarms[train,])

par(mfrow=c(1,2))
plot.Gam(gam3, terms = "s(log(size))", residuals = TRUE, col="lightblue")
title("Example GAM Visualization", font.main= 1)
plot.Gam(gam3, terms = "s(log(totlab_size))", residuals = TRUE, col="lightblue")
```

```{r}
#urea
gam5 <-  gam(log(goutput)~
               s(log(size))+
               s(log(totlab_size))+
               s(log(urea_size)), data = RiceFarms[train,])

gam5.1 <- gam(log(goutput)~
                s(log(size))+
                s(log(totlab_size))+
                s(urea_size), data = RiceFarms[train,])

#anova(gam4, gam5, gam5.1) #use gam5 with log(urea) slightly better

#par(mfrow=c(1,3))
#plot.Gam(gam5, residuals = TRUE, col="lightblue")
```

```{r}
#phosphor
gam6 <- gam(log(goutput)~
              s(log(size))+
              s(log(totlab_size))+
              s(log(urea_size))+
              s(log(phosph_size+1)), data = RiceFarms[train,])

gam6.1 <- gam(log(goutput)~
              s(log(size))+
              s(log(totlab_size))+
              s(log(urea_size))+
              s(phosph_size), data = RiceFarms[train,])

#anova(gam5,gam6, gam7) #use gam6
```

```{r}
#seed
gam8 <- gam(log(goutput)~
              s(log(size))+
              s(log(totlab_size))+
              s(log(urea_size)) +
              s(log(phosph_size+1))+
              s(log(seed_size)), data = RiceFarms[train,])

gam9 <- gam(log(goutput)~
              s(log(size))+
              s(log(totlab_size))+
              s(log(urea_size)) +
              s(log(phosph_size+1))+
              s(seed_size), data = RiceFarms[train,])

#anova(gam6, gam8, gam9) #use gam9
```

```{r}
#pesticide
gam10 <- gam(log(goutput)~
              s(log(size))+
              s(log(totlab_size))+
              s(log(urea_size)) +
              s(log(phosph_size+1))+
              s(seed_size)+
              s(pest_size), data = RiceFarms[train,])

#tail(summary(gam10)$parametric.anova$Pr, n=2)[1]
```

```{r}
gam10.1 <- gam(log(goutput)~
               s(log(size))+
               s(log(totlab_size))+
               s(log(urea_size)) +
               s(log(phosph_size+1))+
               s(seed_size)+
               s(pest_size, df=1), data = RiceFarms[train,])

gam10.2 <- gam(log(goutput)~
               s(log(size))+
               s(log(totlab_size))+
               s(log(urea_size)) +
               s(log(phosph_size+1))+
               s(seed_size)+
               pest_size, data = RiceFarms[train,])

#tail(summary(gam10.1)$parametric.anova$Pr, n=2)[1]
```

```{r}
#add price
gam11 <- gam(log(goutput)~
               s(log(size))+
               s(log(totlab_size))+
               s(log(urea_size)) +
               s(log(phosph_size+1))+
               s(seed_size)+
               pest_size+
               s(price), data = RiceFarms[train,])

#anova(gam10,gam11)
```

```{r}
#add fam_ratio
gam12 <- gam(log(goutput)~
                 s(log(size))+
                 s(log(totlab_size))+
                 s(log(urea_size)) +
                 s(log(phosph_size+1))+
                 s(seed_size)+
                 pest_size+
                 s(price)+
                 s(fam_ratio), data = RiceFarms[train,])

gam12.1 <- gam(log(goutput)~
                 s(log(size))+
                 s(log(totlab_size))+
                 s(log(urea_size)) +
                 s(log(phosph_size+1))+
                 s(seed_size)+
                 pest_size+
                 s(price)+
                 s(fam_ratio, df=13), data = RiceFarms[train,])

#par(mfrow=c(2,4))
#plot.Gam(gam12.1, residuals = TRUE, col="lightblue")

#tail(summary(gam13)$parametric.anova$Pr, n=2)[1]
#anova(gam12.2,gam13)
```


```{r}
#add price info
gam12.2 <- gam(log(goutput)~
                 s(log(size))+
                 s(log(totlab_size))+
                 s(log(urea_size)) +
                 s(log(phosph_size+1))+
                 s(seed_size)+
                 pest_size+
                 s(price)+
                 #s(fam_ratio)+
                 s(pseed), data = RiceFarms[train,])

gam12.3 <- gam(log(goutput)~
                 s(log(size))+
                 s(log(totlab_size))+
                 s(log(urea_size)) +
                 s(log(phosph_size+1))+
                 s(seed_size)+
                 pest_size+
                 s(price)+
                 #s(fam_ratio)+
                 s(pphosph), data = RiceFarms[train,])

gam12.5  <- gam(log(goutput)~
                 s(log(size))+
                 s(log(totlab_size))+
                 s(log(urea_size)) +
                 s(log(phosph_size+1))+
                 s(seed_size)+
                 pest_size+
                 s(price)+
                 #s(fam_ratio)+
                 s(pphosph)+
                 s(purea), data = RiceFarms[train,])

#tail(summary(gam13)$parametric.anova$Pr, n=2)[1]
#anova(gam12.2,gam13)
```

```{r}
#add wage
gam13 <- gam(log(goutput)~
               s(log(size))+
               s(log(totlab_size))+
               s(log(urea_size)) +
               s(log(phosph_size+1))+
               s(seed_size)+
               pest_size+
               s(price)+
               #s(fam_ratio)+
               s(pphosph)+
               s(wage), data = RiceFarms[train,])

gam13.2 <- gam(log(goutput)~
                 s(log(size))+
                 s(log(totlab_size))+
                 s(log(urea_size)) +
                 s(log(phosph_size+1))+
                 s(seed_size)+
                 pest_size+
                 s(price)+
                # s(fam_ratio)+
                 s(pphosph)+
                 wage_cat, data = RiceFarms[train,])

#tail(summary(gam12.2)$parametric.anova$Pr, n=2)[1]
```

```{r}
#bimas
gam14 <- gam(log(goutput)~
               s(log(size))+
               s(log(totlab_size))+
               s(log(urea_size)) +
               s(log(phosph_size+1))+
               s(seed_size)+
               pest_size+
               s(price)+
               #s(fam_ratio)+
               s(pphosph)+
               wage_cat+
               bimas, data = RiceFarms[train,])

#summary(gam14)
#tail(summary(gam14)$parametric.anova$Pr, n=2)[1]
```

```{r}
#varieties
gam15 <- gam(log(goutput)~
               s(log(size))+
               s(log(totlab_size))+
               s(log(urea_size)) +
               s(log(phosph_size+1)) +
               s(seed_size)+
               pest_size+
               s(price)+
               #s(fam_ratio)+
               s(pphosph)+
               wage_cat +
               bimas+
               varieties, data = RiceFarms[train,])

#tail(summary(gam15)$parametric.anova$Pr, n=2)[1]
```

```{r}
#status

gam16 <- gam(log(goutput)~
               s(log(size))+
               s(log(totlab_size))+
               s(log(urea_size)) +
               s(log(phosph_size+1))+
               s(seed_size)+
               pest_size+
               s(price)+
               #s(fam_ratio)+
               s(pphosph)+
               wage_cat +
               bimas+
               status, data = RiceFarms[train,])

#tail(summary(gam16)$parametric.anova$Pr, n=2)[1]
```

```{r}
#region
gam17 <- gam(log(goutput)~
               s(log(size))+
               s(log(totlab_size))+
               s(log(urea_size)) +
               s(log(phosph_size+1))+
               s(seed_size)+
               pest_size+
               s(price)+
               #s(fam_ratio)+
               s(pphosph)+
               wage_cat +
               bimas +
               region, data = RiceFarms[train,])

#tail(summary(gam17)$parametric.anova$Pr, n=2)[1]
#anova(gam14,gam17)
```

```{r}
#adding purea once again
gam17.2 <- gam(log(goutput)~
               s(log(size))+
               s(log(totlab_size))+
               s(log(urea_size)) +
               s(log(phosph_size+1))+
               s(seed_size)+
               pest_size+
               s(price)+
               #s(fam_ratio)+
               s(pphosph)+
               wage_cat +
               bimas +
               #region +
               s(purea), data = RiceFarms[train,])

#adding varieties once again
gam17.4 <- gam(log(goutput)~
               s(log(size))+
               s(log(totlab_size))+
               s(log(urea_size)) +
               s(log(phosph_size+1))+
               s(seed_size)+
               pest_size+
               s(price)+
               #s(fam_ratio)+
               s(pphosph)+
               wage_cat +
               bimas +
               #region +
               s(purea)+
               varieties, data = RiceFarms[train,])

```

## 3.2 Model Selection
For retrieving our final model we make use of an approach called forward selection (cf. James p. 79). We begin with a model, that contains the variable with highest correlation to our dependent variable goutput, which is size. We then add different representations of one variable and add those to the model, which result in the lowest deviance for the new two-variable model. This approach is continued until all variables have been tried out. Forward selection is a greedy approach, and might include variables early that later become redundant. 

We will first start with using smoothing spline function for each variable s() and we will use the default family, the gaussian(link = "identity"), in the GAM. We will also use the default degrees of freedom for each spline function, which is 3 and corresponds to a cubic spline. Based on the non-parametric p-value we access the suitability of the complexity of the chosen function and try out a different amount of degrees of freedom or other function e.g. just linear representation.

The model results are mainly compared by using the p-value from the model summary, which is telling if the variable is significant important. We will use a p-value of 5% for the evaluation of variable significance. To compare different representations of the same variable we will use the deviance. In addition we list other metrics like MSE on the training and validation data as well as AIC. For an comparison of larger models with smaller models we will perform additional anova tests.

**table for model selection:** The following table summarizes the variable selection process for the final GAM model. Starting from the top of the table we add for each grouped row in the table another variable with different representations of the variable for example checking whether log transformation yields any better results than using the variable as is. For each grouped row, the last model, yielding a p-value < 0.05 is the chosen model for further analysis.

\newpage

```{r }
#results='asis'
options(tinytex.verbose = TRUE)

###############
#table for model selection
#https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html#Overview

gams <- list(gam17.4,gam17.2, gam17,gam16, gam15, gam14, gam13.2, gam13, gam12.5, gam12.3,gam12.2, gam12, gam11, gam10.2, gam10, gam9, gam8 , gam6, gam6.1,  gam5, gam5.1,  gam3, gam4, gam2, gam1)
gams <- rev(gams)

l <- length(gams)
var <- rep(NA, l)
df <- rep(NA, l)
MSE.train <- rep(NA, l)
MSE.val <- rep(NA, l)
dev <- rep(NA, l)
aic <- rep(NA, l)
p_val_p <- rep(NA, l)
p_val_np <- rep(NA, l)
df_np <- rep(NA, l)

for (i in 1:l){
  var[i] <- tail(names(coefficients(gams[[i]])), n=1)
  df[i] <-  round(df.residual(gams[[i]]))
  MSE.train[i] <- round(MSE(gams[[i]], RiceFarms, train))
  MSE.val[i] <- round(MSE(gams[[i]], RiceFarms, val))
  dev[i] <- round(deviance(gams[[i]]),1)
  aic[i] <- round(AIC(gams[[i]]),1)
  p_val_p[i] <- round(tail(summary(gams[[i]])$parametric.anova$Pr, n=2)[1],4)
  p_val_np[i] <- round(tail(summary(gams[[i]])$anova$Pr, n=1)[1],4)
  df_np[i] <- tail(summary(gams[[i]])$anova$`Npar Df`,n=1)
}

model_perf <- data.frame(var, df, MSE.train, MSE.val, dev, aic, p_val_p, p_val_np, df_np)
#colnames(model_perf) <- c("var","df","MSE.train","MSE.val","dev","aic","p_val_p", "p_val_np", "df_np")

model_perf$var[17]<- "s(pphosph)+s(purea)"
model_perf$var[20:23] <- c("bimas", "bimas+varieties", "bimas+status", "bimas+region")
model_perf$var[25] <- "s(purea)+varieties"

kbl(model_perf, caption = "GAM comparison for variable selection",longtable = TRUE,booktabs=TRUE) %>%
  kable_paper("hover", full_width = FALSE) %>%
  pack_rows("size", 1, 2,latex_gap_space="1em") %>%
  pack_rows("size+labour", 3, 4, latex_gap_space="1em") %>%
  pack_rows("size+labour+urea", 5, 6, latex_gap_space="1em") %>%
  pack_rows("size+labour+urea+phosphor", 7, 8, latex_gap_space="1em") %>%
  pack_rows("size+labour+urea+phosphor+seed", 9, 10, latex_gap_space="1em") %>%
  pack_rows("size+labour+urea+phosphor+seed+pesticide", 11, 12, latex_gap_space="1em") %>%
  pack_rows("size+labour+urea+phosphor+seed+pesticide+price", 13, 13, latex_gap_space="1em") %>%
  pack_rows("size+labour+urea+phosphor+seed+pesticide+price+family_labour", 14, 14, latex_gap_space="1em") %>%
  pack_rows("size+labour+urea+phosphor+seed+pesticide+price+price_info", 15, 17, latex_gap_space="1em") %>%
  pack_rows("size+labour+urea+phosphor+seed+pesticide+price+price_info+wage", 18, 19, latex_gap_space="1em") %>%
  pack_rows("size+labour+urea+phosphor+seed+pesticide+price+price_info+wage+categorical", 20, 23, latex_gap_space="1em") %>%
  pack_rows("size+labour+urea+phosphor+seed+pesticide+price+price_info+wage+categorical+excluded", 24, 25, latex_gap_space="1em") #%>%
  #landscape()

```

\newpage

**Table columns:**

* var: the variable that is added
* df: degrees of freedom 
* MSE.train: MSE on training data
* MSE.val: MSE on validation data
* dev: residual deviance (goodness of fit)
* aic: Akaike information criterion
* p_val_p: parametric p-value for last added variable
* p_val_np: non-parametric p-value for last added variable
* df_np: non-parametric degrees of freedom for last added variable

**Variable selection explanation:**

* size, labour, and urea: we will use logarithm of these variables, because of a smaller deviance
* phosphor: this variable has lots of zeros as values. So we can not use directly log transformation because log of 0 is -inf. log(x+1) transformation is the best way to avoid errors created  by log transformation and is widely used among data scientists. So we will use this approach. We will use logarithm of phosph_size, because of a smaller deviance.
* seed: we will use seed as is, because of a smaller deviance.
* pesticide: we will use a linear fit of the variable, because the Anova test shows (cf. Anova test for pest_size on page 17) that the variable is significantly important for the model, but a non-linear transformations does not yield a significant difference.
* price: p-value < 0.05, so we add the variable to our model 
* family labour: p-value > 0.05, thus we do not use this variable
* material prices: 
    * pseed: p-value > 0.05, thus we do not use this variable
    * pphosph: p-value < 0.05, so we add the variable to our model 
    * pphosph + purea: adding purea in addition to pphosph does does not improve the model significantly (p-value < 0.05)
* wage: we will use wage_cat instead of a non-linear transformations of wage, due to the results of the anova test (cf. Anova test for wage/wage_cat on page 17)
* categorical variables: 
    + bimas: p-value < 0.05, so we add the variable to our model
    + bimas+ additional categorical variables: adding the other categorical variables does not significantly improve the model (p-value > 0.05) so no other categorical variables are added in addition to bimas.
* final model: we check again for excluded variables after some more variables have been added if the results change. 
    + purea: the p-value now shows a values less than 0.05, thus the variable is significantly improving the model.
    + purea + varieties: adding varieties in addition to purea does not yield a significant improvement of the model.

\newpage
**Anova test for pest_size** 

```{r }
#Anova test
anova(gam9,gam10.2, gam10)
```

**Anova test for wage/wage_cat**

```{r }
anova(gam12.3, gam13.2,gam13)
```

\newpage

```{r fig.asp=0.6}
gams <- list(gam17.2, gam14, gam13.2,gam12.3, gam11, gam10.2, gam9, gam6, gam5, gam3, gam2)
gams <- rev(gams)

l <- length(gams)
var <- rep(NA, l)
MSE.train <- rep(NA, l)
MSE.val <- rep(NA, l)
dev <- rep(NA, l)

for (i in 1:l){
  var[i] <- tail(names(coefficients(gams[[i]])), n=1)
  MSE.train[i] <- round(MSE(gams[[i]], RiceFarms, train))
  MSE.val[i] <- round(MSE(gams[[i]], RiceFarms, val))
  dev[i] <- deviance(gams[[i]])
}

model_perf <- data.frame(MSE.train, MSE.val, dev, var,row.names = var)

df.long <- pivot_longer(model_perf, cols=1:3, names_to = "metric", values_to = "value")
#make var ordered factor

df.long$var <- factor(df.long$var, levels =var)

mse_plot <- ggplot(df.long[df.long$metric!="dev",])+
  aes(x= var, y=value, group=metric, color = metric)+
  geom_line()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
        axis.title.x = element_blank(), legend.position = c(0.8, 0.8))

dev_plot <- ggplot(df.long[df.long$metric=="dev",])+
  aes(x=var, y=value, group=1)+
  geom_line()+
  ylab("deviance")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
        axis.title.x = element_blank())
  
patch1 <- mse_plot + dev_plot

patch1 + plot_annotation(
  title = 'Model performance'
)
```

The model performance graphs are visualizing the metrics MSE and deviance for the chosen models based on the forward selection process. The deviance is always decreasing, since this was our selection criteria, but for the MSE values we can see that some variables have a negative influence on the MSE. Interesting here is that the MSE values for the validation data are always below the training data. Normally one would expect to overfit on the training data so that a performance on non-seen data is worse. But here this isn't the case. This might be due to the fact that there are lots of outliers in the data, and the training data set might include proportionally more outliers than the validation data, which negatively affects the MSE.

```{r fig.asp=0.5, fig.align = "center"}
#library(ggfortify)
#autoplot(gam17.4)

ylim <- c(min(residuals(gam1)), max(residuals(gam1)))
xlim <- c(min(fitted(gam17.2)), max(fitted(gam17.2)))

#par(mfrow=c(1,3))
#plot(fitted(gam1),residuals(gam1))
#plot(fitted(gam2),residuals(gam2), ylim=ylim)
#plot(fitted(gam17.4),residuals(gam17.4), ylim=ylim)

gam1_res <- data.frame(fitted=fitted(gam1),residuals=residuals(gam1))
gam2_res <- data.frame(fitted=fitted(gam2),residuals=residuals(gam2))
gam17.2_res <- data.frame(fitted=fitted(gam17.2),residuals=residuals(gam17.2))

gam1_res_plot <- ggplot(gam1_res)+
  aes(x=fitted,y=residuals)+
  xlim(xlim)+
  geom_point(alpha = 3/10)+
  theme(plot.subtitle=element_text(size=7))+
  labs(subtitle = "s(log(size))")

  #labs(title = "Fitted vs. Residuals", subtitle = expression(atop("1st model", paste("s(log(size))"))))

gam2_res_plot <- ggplot(gam2_res)+
  aes(x=fitted,y=residuals)+
  ylim(ylim) +
  xlim(xlim)+
  geom_point(alpha = 3/10) +
  theme(plot.subtitle=element_text(size=7))+
  labs(subtitle = "s(log(size)) + s(log(totlab_size))")

gam17.2_res_plot <- ggplot(gam17.2_res)+
  aes(x=fitted,y=residuals)+
  ylim(ylim) +
  xlim(xlim) +
  geom_point(alpha = 3/10) +
  theme(plot.subtitle=element_text(size=7))+
  labs(subtitle = "Final model")
  

patch <- gam1_res_plot+gam2_res_plot+gam17.2_res_plot

patch + plot_annotation(
  title = 'Residual comparison'
)
```

The residual visualizations are exemplary for our 1st, 2nd and final model. From the first model we can clearly see a relationship in the residuals. The relationship looks quadratic. This relationship vanishes when the 2nd variable totlab_size is added to our model. The last residual plot of our final model is more centered in comparison to the other models.

## 3.3 Final Model

Our final GAM has the following formula:

```{r}
#summary(gam17.4)$call
formula(gam17.2)
#summary(gam17.4)$dispersion
#summary(gam17.4)$df
#summary(gam17.4)$deviance.resid
#summary(gam17.4)$deviance
```

The final GAM does not include the following variables:

* numerical: fam_ratio, pseed
* categorical: varieties, status, region

All numerical variables have been included by using smoothing spline function with a default degree of freedom of 3 except the variable pest_size, where we used just a linear fit.

To have a better comparison against the Lasso model we will calculate R-squared and adjusted R-squared.

**R-squared (R²)**: is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. Essentially, an R-Squared value of 0.9 would indicate that 90% of the variance of the dependent variable being studied is explained by the variance of the independent variable.

$$R^2 = 1- \frac{Unexplained\:Variation}{Total\:Variation}$$

```{r}
r_sq <- function(model){
  UEV <- length(train)*MSE(gam17.2, RiceFarms)
  TV <- sum((RiceFarms$goutput[train]-mean(RiceFarms$goutput[train]))^2)
  return(1-UEV/TV)
}
```

$R^2$ = `r r_sq(gam17.2)`

**Adjusted R-Squared**: The adjusted R-squared compares the descriptive power of regression models that include diverse numbers of predictors. Every predictor added to a model increases R-squared and never decreases it. Thus, a model with more terms may seem to have a better fit just for the fact that it has more terms, while the adjusted R-squared compensates for the addition of variables and only increases if the new term enhances the model above what would be obtained by probability and decreases when a predictor enhances the model less than what is predicted by chance.

$$R^2_{adj} = 1- \frac{(1-R^2)*(n-1)}{n-p-1}$$
n = total sample size, p = number of predictors

```{r}
r_sq_adj <- function(model){
  R2adj <- 1- ((1 - r_sq(model)) * (length(model$y) - 1)/(length(model$y) - length(model$coefficients) - 1))
  return(R2adj)
}
```

$R^2_{adj}$= `r r_sq_adj(gam17.2)`

**Visualization explanation** 
Because the model is additive, we can examine the effect of each $f_{j}$ on $Y$ individually while holding all of the other variables fixed. All panels from the final model have the same vertical scale. This allows us to visually assess the relative contributions of each of the variables. We observe that size, totlab_size, urea_size, phosph_size, seed_size, pest_size and purea have a positive influence on goutput. E.g. with increasing size, goutput is increasing. Wherease price and pphosph have a negative influence on goutput.

For the categorical variables we can see, that the farmers goutput tends to be higher when paying a wage less than 100 Rupiah per hour and bimas equal to yes is also increasing goutput.

\newpage
**Final GAM visualization**
```{r fig.height=8, fig.align='center'}

gam_num <- c("s(log(size))", "s(log(totlab_size))", "s(log(urea_size))" , "s(log(phosph_size + 1))" , "s(seed_size)",  "s(price)" , "s(pphosph)", "s(purea)")

#gam_cat <-  c("wage_cat", "bimas", "region", "varieties")
gam_cat <-  c("pest_size", "wage_cat", "bimas")

par(mfrow=c(6,2), 
    oma = c(1,1,1,1),
    mar = c(2,3.5,2,2) + 0.1,
    mgp = c(2, 0.5, 0)) #The default is c(3, 1, 0).
plot.Gam(gam17.2, residuals = TRUE, se=TRUE, terms = gam_num  , col="lightblue", cex.lab = 0.8)

for(i in 1:length(gam_cat)){
  plot.Gam(gam17.2, residuals = TRUE, se=TRUE, terms = gam_cat[i] , col="lightblue", cex.lab = 0.8, xlab="")
  title(xlab=gam_cat[i], line=0.5, cex.lab=1.2)
}

```
\newpage

# 4 Comparison
After the best model of Lasso Regression and the Generalised Additive Model has been determined by validation, a comparison of these is undertaken. \newline

The first striking thing is that in both models the MSE is smaller in the validation set than in the training set. This is contrary to intuition. As already mentioned in the previous chapter GAMs, this could be due to the fact that there are lots of outliers in the data, and the training data set might include proportionally more outliers than the validation data, which negatively affects the MSE.\newline 
The test data set that was put away at the beginning is used for model comparison. We will compare their metrics based on the already trained model (train) and train another model for each of the model types using training and validation data (full). Again, familiar metrics as in the previous sections are used for evaluation. Among others, the number of degrees of freedom, MSE and the $R^2$. In addition, the residuals sum of squares and the residuals are compared to see how good the prediction of the model is. 
```{r}
r_sq_test<-function(model.pred,y){
  sst <- sum((y - mean(y))^2)
  sse <- sum((model.pred - y)^2)
  
  rsq <- 1 - sse/sst
  return(rsq)
}

g.aic<-round(AIC(gam17.2),1)
g.bic<-round(BIC(gam17.2),1)
g.pred<-predict(gam17.2,newdata=RiceFarms[test,])
g.df<-round(df.residual(gam17.2),1)
g.rsq<-round(r_sq_test(exp(g.pred),RiceFarms$goutput[test]),3)
g.mse<-round(MSE(gam17.2,RiceFarms,test),4)
g.rss<-(exp(g.pred)-RiceFarms$goutput[test])^2
g.residual<-g.pred-log(RiceFarms$goutput[test])

l.aic<-round(fitted.13$BIC$AIC,1)
l.bic<-round(fitted.13$BIC$BIC,1)
x<-prepare_x(goutput~log(size)+log(seed_size)+totlab_size+log(urea)+wage_cat)
l.pred<-predict(lasso.model.13,s=fitted.13$bestlam,newx=x[test,])
l.df<-length(train)- round(fitted.13$df,1)
l.rsq<-round(r_sq_test(l.pred,y[test]),3)
l.mse<-round(mean((exp(l.pred)-exp(y[test]))^2),4)
l.rss<-(exp(l.pred)-exp(y[test]))^2
l.residual<-l.pred-y[test]

#new gam17.2 with training and validation data

gam17.2_full <- gam(log(goutput)~
               s(log(size))+
               s(log(totlab_size))+
               s(log(urea_size)) +
               s(log(phosph_size+1))+
               s(seed_size)+
               pest_size+
               s(price)+
               #s(fam_ratio)+
               s(pphosph)+
               wage_cat +
               bimas +
               #region +
               s(purea), data = RiceFarms[c(train,val),])

g2.aic<-round(AIC(gam17.2_full),1)
g2.bic<-round(BIC(gam17.2_full),1)
g2.pred<-predict(gam17.2_full,newdata=RiceFarms[test,])
g2.df<-round(df.residual(gam17.2_full),1)
g2.rsq<-round(r_sq_test(exp(g2.pred),RiceFarms$goutput[test]),3)
g2.mse<-round(MSE(gam17.2_full,RiceFarms,test),4)
g2.rss<-(exp(g.pred)-RiceFarms$goutput[test])^2
g2.residual<-g.pred-log(RiceFarms$goutput[test])

#new lasso13 with training and validation data

train_lasso_2<-function(x,y){
  best.model <-glmnet(x[c(train,val),],y[c(train,val)],alpha=1, lambda = fitted.13$bestlam, thresh =1e-12)
  res<-BICAIC(best.model)
  
 #Now what is the test MSE with this best lambda
  lasso.pred.train<-predict(best.model ,newx=x[c(train,val),]) 
  lasso.pred<-predict(best.model ,s=fitted.13$bestlam ,newx=x[test,]) 
  
  mse.train<-mean((lasso.pred.train -y[c(train,val)])^2)
  mse.val<-mean((lasso.pred -y[test])^2)
  rss<-(lasso.pred-y[test])^2
  rs<-(exp(lasso.pred)-exp(y[test]))
  
  #plot(rss)
  sst <- sum((y[test] - mean(y[test]))^2)
  sse <- sum((lasso.pred - y[test])^2)
  
  #find R-Squared
  rsq <- 1 - sse/sst
  return(list('bestlam'=fitted.13$bestlam,'BIC'=res,'RSS'=rss,'RSq'=rsq,'FITTED'=lasso.pred,'df'=best.model$df,'mse.train'=mse.train,'mse.test'=mse.val, 'residual'=rs))
}

fitted.13_2<-train_lasso_2(x,y)

l2.aic<-round(fitted.13_2$BIC$AIC,1)
l2.bic<-round(fitted.13_2$BIC$BIC,1)
l2.df<-length(c(train,val))- round(fitted.13_2$df,1)
l2.rsq <- round(fitted.13_2$RSq,3)
l2.rss <- fitted.13_2$residual

l2.pred<-fitted.13_2$FITTED
l2.rss<-fitted.13_2$RSS
l2.residual<-fitted.13_2$residual

#l2.pred<-predict(fitted.13_2,s=fitted.13$bestlam,newx=x[test,])
l2.mse<-round(mean((exp(fitted.13_2$FITTED)-exp(y[test]))^2),4)
```

           
| Metric   | Lasso(train)| GAM(train)| Lasso(full) | GAM(full) |
|:--------:|:-----------:|:---------:|:-----------:|:---------:|
|   df     |`r l.df `    |`r g.df `  | `r l2.df `  | `r g2.df `|
|   $R^2$  |`r l.rsq `   |`r g.rsq ` | `r l2.rsq`  | `r g2.rsq`|
|   MSE    |`r l.mse `   |`r g.mse ` | `r l2.mse`  | `r g2.mse`|


```{r}
d<-data.frame(g.rss,l.rss,g.residual,l.residual)
#d<-data.frame(g2.rss,l2.rss,g2.residual,l2.residual)

rss_plot <- ggplot(d)+
  aes(x= seq(1:length(g.rss)), y=g.rss/(10^6), color = 'g.rss')+
  geom_line()+
  geom_line(aes(y=l.rss/(10^6),color='l.rss'))+
  ylab('RSS in 10^6 ')+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
        axis.title.x = element_blank(), legend.position = c(0.8, 0.8))

residual_plot <- ggplot(d)+
  aes(x=g.pred , y=g.residual, color = 'g.residual')+
  geom_point()+
  geom_point(aes(x=l.pred,y=l.residual,color='l.residual'))+
  xlab('fitted')+
  ylab('Residuals')+
  theme( legend.position = c(0.8, 0.8))

rss_plot + plot_annotation(title='Model RSS Comparison')+
  scale_colour_discrete(labels=c("GAM", "LASSO"))
residual_plot + plot_annotation(title='Model Residual Comparison')+
  scale_colour_discrete(labels=c("GAM", "LASSO"))

```

It can be seen from the tables that both models perform equally well. The Lasso model performs a little better with the $R^2$, but this should not be a problem. However, the Lasso model has more degrees of freedom than the other model, which is due to the fact that fewer predictor variables are needed than in the GAM model. However, the GAM model performs better on the MSE. 
\newline 
With regard to the graphs, there is also no significant difference between the models. It can be seen, however, that in the RSS plot all two models have similar false predictions at certain points in the test data set, since the peaks are at the same points. This is probably due to extreme values in the data set. We will lastly show some example predictions for goutput (gross output of rice in kg) using both models for an observation below 25% quantile, above 75% quantile and within the 25% to 75% quantile of goutput.

```{r}
test_observations <- c(112, 2, 187)

#predict
g_p <- round(exp(predict(gam17.2, newdata = RiceFarms[test,])[test_observations]))
l_p <- round(exp(predict(lasso.model.13,s=fitted.13$bestlam,newx=x[test,])[test_observations]))

true <- RiceFarms[test,"goutput"][test_observations]
```
| quantile | true goutput | Lasso    | GAM       |
|:--------:|:------------:|:--------:|:---------:| 
| <25%     | `r true[1]`  |`r l_p[1]`|`r g_p[1]` |
| 25% - 75%| `r true[2]`  |`r l_p[2]`|`r g_p[2]` |
| >75%     | `r true[3]`  |`r l_p[3]`|`r g_p[3]` |

From the table we can see that especially for the larger outlier where the true goutput equals `r true[3]` both models are very far of, nevertheless the GAM model is giving a better prediction. For the other data points the predictions are much better from both of the models and especially the lasso model achieves good results, whereas the GAM model is a little overestimating the goutput. This behavior might also be visible in the residual comparison graphic, where the Lasso model has some more negative residuals and the GAM model some more larger positive residuals.

In conclusion both models can be used for predicting the output of the data set. It is up to the user to decide which model to use. If a simple model is needed, even for explanation, the Lasso model is preferred, because the GAM model is more complex and is used for non-linear complex relationships in the data set.
\newpage