README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
options(xts.warn_dplyr_breaks_lag = FALSE)
```

# InvestigatoR

<!-- badges: start -->
<!-- badges: end -->

The goal of InvestigatoR is to ...

## Installation

You can install the development version of InvestigatoR from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("ericschumann12/InvestigatoR")
```

## Basic Workflow

This is a basic example which shows you how to solve a common problem:

```{r loaad_package}
library(InvestigatoR)
library(tidyverse)
## basic example code
```

First we load the complimentary dataset that comes with the package:

```{r load_data}
data("data_ml")
data_ml |> distinct(date) |> pull(date) |> min()
data_ml |> distinct(date) |> pull(date) |> max()
```

For a description, see.... The original datset was provided by Guillaume Coqueret and Tony Guida with their book [Machine Learning for Factor Investing](https://mlfactor.com).

Next we specify the set of features that should be used for return prediction, specify some options for backtesting, such as whether the return prediction should be done with a rolling window (TRUE), the window size ("5 years"), the step size("3 months", this means, how often do we reestimate the ML model), the offset ("1 year" to avoid any form of data spillage).

```{r specify_features}
return_label <- "R1M_Usd"
features <- c("Div_Yld", "Eps", "Mkt_Cap_12M_Usd", "Mom_11M_Usd", "Ocf", "Pb", "Vol1Y_Usd")
rolling <- FALSE; window_size= "5 years"; step_size = "1 months"; offset = "1 month"; in_sample = TRUE
```

Next we specify the machine learning configuration. We can specify multiple configurations, for example, one for a linear regression model and one for a gradient boosting model. The configuration for the linear regression model is empty, as we use the default configuration. The configuration for the gradient boosting model specifies the number of rounds, the maximum depth of the trees, the learning rate, and the objective function. Other functions still need to be implemented.

```{r specify_ml_config}
 ml_config <- list(ols_pred = list(pred_func="ols_pred", config=list()),
                   xgb_pred = list(pred_func="xgb_pred", 
                                   config1=list(nrounds=10, max_depth=3, eta=0.3, objective="reg:squarederror"),
                                   config2=list(nrounds=10, max_depth=3, eta=0.1, objective="reg:squarederror")))
```

Finally, we call the backtesting function. The function returns a data frame with the backtesting results. The data frame contains the following columns: date, return_label, features, rolling, window_size, step_size, offset, in_sample, ml_config, model, predicted returns, actual realized returns, and errors for all predictions. The model column contains the name of the model that was used for the prediction. The prediction column contains the predicted returns. The actual column contains the actual returns. The error column contains the difference between the predicted and the actual returns.

```{r backtesting_returns}
 rp <- backtesting_returns(data=data_ml, return_prediction_object=NULL,
   return_label, features, rolling=FALSE, window_size, step_size, offset, in_sample, ml_config, append=FALSE, num_cores=NULL)
```

Next we take this predictions and analyse thbeir statistical properties

```{r analyse_predictions}
rp$predictions |> head()
rp_stats <- summary(rp)
print(rp_stats)
```

Next, we map those predictions into various portfolios (quantiles) and analyse their performance. We specify various weight restrictions, such as the minimum and maximum weight, the minimum and maximum cutoff quantile, and the b parameter that adjusts the amount of investment per leg (b=1 means, we go 100% long and short). We also specify the predictions that should be used for the portfolio formation (e.g., ols_1, xgb_1, xgb_2). 


```{r specify_pf_config}
pf_config <- list(predictions = c("ols_1","xgb_1","xgb_2"),
                  quantile_weight = list(pred_func="quantile_weights",
                    config1=list(quantiles = list(long = 0.20, short = 0.20),allow_short_sale = FALSE,
                      min_weight = -1,  max_weight = 1, b = 1),
                    config2=list(quantiles = list(long = 0.10, short = 0.10),allow_short_sale = FALSE,
                      min_weight = -1,  max_weight = 1, b = 1)))
```

Finally we run the portfolio formation process.

```{r backtesting_portfolios}
pf <- backtesting_portfolios(return_prediction_object = rp, pf_config = pf_config)
```
Let us check the content of pf, and calculate some summary statistics

```{r check_pf}
pf$weights |> head()
pf$portfolio_returns |> head()
pf_stats <- summary(pf)
print(pf_stats)
```

Alternatively, we can also calculate statistics from the `PerformanceAnalytics` package.


```{r check_pf_perf}
library(tidyquant)
# tidyquant::tq_performance_fun_options()
summary(pf)
summary(pf, type = "table.AnnualizedReturns")
summary(pf, type = "table.Distributions")
summary(pf, type = "table.DownsideRisk")
summary(pf, type = "table.DrawdownsRatio")
summary(pf, type = "cov")
```


Now we plot the corresponding cumulative returns of the portfolios

```{r plot_pf}
plot(pf)
```

Alternatively, the plotting function is designed in  a way, that it takes plotting function from the `tidyquant` package as inputs. 

```{r plot_pf_tq, warning=FALSE}
library(tidyquant)
ls("package:PerformanceAnalytics")[grepl("chart",ls("package:PerformanceAnalytics"))]
plot(pf, type = "chart.CumReturns")
plot(pf, type = "charts.PerformanceSummary")
plot(pf, type = "chart.Boxplot")
```

# Implement own Functions

Lets start with a simple random forest implementation. We need to specify the function that should be used for the prediction, the configuration of the function, and the name of the function. The logic is easy: create a function "rf_pred" having arguments: 'train_data', 'test_data', as well as a 'config' that is taken by the prediction function.

```{r specify_rf_config}
rf_config <- list(rf_pred = list(pred_func="rf_pred", 
                                 config1=list(num.trees=100, max.depth=3, mtry=3)))
```

Next we implement the prediction function. The function takes the training data, the test data, and the configuration as arguments. The function returns the predicted returns. The function uses the `randomForest` package to fit a random forest model to the training data and to predict the returns for the test data.

```{r implement_rf}
rf_pred <- function(train_data, test_data, config) {
  train_features <- (train_data[,4:ncol(train_data)])
  train_label <- as.matrix(train_data[,3])
  # add data
  config$x <- train_features
  config$y <- train_label
  # do the training
  fit <- do.call(ranger::ranger, config)
  # do the predictions
  predictions <- as.vector(predict(fit, test_data)$predictions)
  # match preds back to stock_id and date
  predictions <- tibble::tibble(stock_id=test_data$stock_id, date=test_data$date, pred_return=predictions)
  return(predictions)
}
```

Finally, we call the backtesting function with the new configuration. The function returns a data frame with the backtesting results. The data frame contains the following columns: date, return_label, features, rolling, window_size, step_size, offset, in_sample, ml_config, model, predicted returns, actual realized returns, and errors for all predictions. The model column contains the name of the model that was used for the prediction. The prediction column contains the predicted returns. The actual column contains the actual returns. The error column contains the difference between the predicted and the actual returns. By providing the 'rp' object from before, we add an additional prediction.

```{r backtesting_returns_rf}
rp_rf <- backtesting_returns(data=data_ml, return_prediction_object=rp,
   return_label, features, rolling=FALSE, window_size, step_size, offset, in_sample, rf_config, append=FALSE, num_cores=NULL)
```
Next we chack the prediction summary.

```{r analyse_predictions_rf}
rp_rf$predictions |> head()
rp_rf_stats <- summary(rp_rf)
print(rp_rf_stats)
```

Lets create portfolios again.

```{r backtesting_portfolios_rf}
pf_config <- list(predictions = c("xgb_2","rf_1"),
                  quantile_weight = list(pred_func="quantile_weights",
                    config1=list(quantiles = list(long = 0.20, short = 0.20),allow_short_sale = FALSE,
                      min_weight = -1,  max_weight = 1, b = 1),
                    config2=list(quantiles = list(long = 0.10, short = 0.10),allow_short_sale = FALSE,
                      min_weight = -1,  max_weight = 1, b = 1)))
pf_rf <- backtesting_portfolios(return_prediction_object = rp_rf, pf_config = pf_config)
```

And finally summarise the portfolio statistics.

```{r check_pf_rf}
plot(pf_rf)
pf_rf_stats <- summary(pf_rf)
print(pf_rf_stats)
```