-
Notifications
You must be signed in to change notification settings - Fork 9
/
README.Rmd
202 lines (154 loc) · 8.87 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
options(xts.warn_dplyr_breaks_lag = FALSE)
```
# InvestigatoR
<!-- badges: start -->
<!-- badges: end -->
The goal of InvestigatoR is to ...
## Installation
You can install the development version of InvestigatoR from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("ericschumann12/InvestigatoR")
```
## Basic Workflow
This is a basic example which shows you how to solve a common problem:
```{r loaad_package}
library(InvestigatoR)
library(tidyverse)
## basic example code
```
First we load the complimentary dataset that comes with the package:
```{r load_data}
data("data_ml")
data_ml |> distinct(date) |> pull(date) |> min()
data_ml |> distinct(date) |> pull(date) |> max()
```
For a description, see.... The original datset was provided by Guillaume Coqueret and Tony Guida with their book [Machine Learning for Factor Investing](https://mlfactor.com).
Next we specify the set of features that should be used for return prediction, specify some options for backtesting, such as whether the return prediction should be done with a rolling window (TRUE), the window size ("5 years"), the step size("3 months", this means, how often do we reestimate the ML model), the offset ("1 year" to avoid any form of data spillage).
```{r specify_features}
return_label <- "R1M_Usd"
features <- c("Div_Yld", "Eps", "Mkt_Cap_12M_Usd", "Mom_11M_Usd", "Ocf", "Pb", "Vol1Y_Usd")
rolling <- FALSE; window_size= "5 years"; step_size = "1 months"; offset = "1 month"; in_sample = TRUE
```
Next we specify the machine learning configuration. We can specify multiple configurations, for example, one for a linear regression model and one for a gradient boosting model. The configuration for the linear regression model is empty, as we use the default configuration. The configuration for the gradient boosting model specifies the number of rounds, the maximum depth of the trees, the learning rate, and the objective function. Other functions still need to be implemented.
```{r specify_ml_config}
ml_config <- list(ols_pred = list(pred_func="ols_pred", config=list()),
xgb_pred = list(pred_func="xgb_pred",
config1=list(nrounds=10, max_depth=3, eta=0.3, objective="reg:squarederror"),
config2=list(nrounds=10, max_depth=3, eta=0.1, objective="reg:squarederror")))
```
Finally, we call the backtesting function. The function returns a data frame with the backtesting results. The data frame contains the following columns: date, return_label, features, rolling, window_size, step_size, offset, in_sample, ml_config, model, predicted returns, actual realized returns, and errors for all predictions. The model column contains the name of the model that was used for the prediction. The prediction column contains the predicted returns. The actual column contains the actual returns. The error column contains the difference between the predicted and the actual returns.
```{r backtesting_returns}
rp <- backtesting_returns(data=data_ml, return_prediction_object=NULL,
return_label, features, rolling=FALSE, window_size, step_size, offset, in_sample, ml_config, append=FALSE, num_cores=NULL)
```
Next we take this predictions and analyse thbeir statistical properties
```{r analyse_predictions}
rp$predictions |> head()
rp_stats <- summary(rp)
print(rp_stats)
```
Next, we map those predictions into various portfolios (quantiles) and analyse their performance. We specify various weight restrictions, such as the minimum and maximum weight, the minimum and maximum cutoff quantile, and the b parameter that adjusts the amount of investment per leg (b=1 means, we go 100% long and short). We also specify the predictions that should be used for the portfolio formation (e.g., ols_1, xgb_1, xgb_2).
```{r specify_pf_config}
pf_config <- list(predictions = c("ols_1","xgb_1","xgb_2"),
quantile_weight = list(pred_func="quantile_weights",
config1=list(quantiles = list(long = 0.20, short = 0.20),allow_short_sale = FALSE,
min_weight = -1, max_weight = 1, b = 1),
config2=list(quantiles = list(long = 0.10, short = 0.10),allow_short_sale = FALSE,
min_weight = -1, max_weight = 1, b = 1)))
```
Finally we run the portfolio formation process.
```{r backtesting_portfolios}
pf <- backtesting_portfolios(return_prediction_object = rp, pf_config = pf_config)
```
Let us check the content of pf, and calculate some summary statistics
```{r check_pf}
pf$weights |> head()
pf$portfolio_returns |> head()
pf_stats <- summary(pf)
print(pf_stats)
```
Alternatively, we can also calculate statistics from the `PerformanceAnalytics` package.
```{r check_pf_perf}
library(tidyquant)
# tidyquant::tq_performance_fun_options()
summary(pf)
summary(pf, type = "table.AnnualizedReturns")
summary(pf, type = "table.Distributions")
summary(pf, type = "table.DownsideRisk")
summary(pf, type = "table.DrawdownsRatio")
summary(pf, type = "cov")
```
Now we plot the corresponding cumulative returns of the portfolios
```{r plot_pf}
plot(pf)
```
Alternatively, the plotting function is designed in a way, that it takes plotting function from the `tidyquant` package as inputs.
```{r plot_pf_tq, warning=FALSE}
library(tidyquant)
ls("package:PerformanceAnalytics")[grepl("chart",ls("package:PerformanceAnalytics"))]
plot(pf, type = "chart.CumReturns")
plot(pf, type = "charts.PerformanceSummary")
plot(pf, type = "chart.Boxplot")
```
# Implement own Functions
Lets start with a simple random forest implementation. We need to specify the function that should be used for the prediction, the configuration of the function, and the name of the function. The logic is easy: create a function "rf_pred" having arguments: 'train_data', 'test_data', as well as a 'config' that is taken by the prediction function.
```{r specify_rf_config}
rf_config <- list(rf_pred = list(pred_func="rf_pred",
config1=list(num.trees=100, max.depth=3, mtry=3)))
```
Next we implement the prediction function. The function takes the training data, the test data, and the configuration as arguments. The function returns the predicted returns. The function uses the `randomForest` package to fit a random forest model to the training data and to predict the returns for the test data.
```{r implement_rf}
rf_pred <- function(train_data, test_data, config) {
train_features <- (train_data[,4:ncol(train_data)])
train_label <- as.matrix(train_data[,3])
# add data
config$x <- train_features
config$y <- train_label
# do the training
fit <- do.call(ranger::ranger, config)
# do the predictions
predictions <- as.vector(predict(fit, test_data)$predictions)
# match preds back to stock_id and date
predictions <- tibble::tibble(stock_id=test_data$stock_id, date=test_data$date, pred_return=predictions)
return(predictions)
}
```
Finally, we call the backtesting function with the new configuration. The function returns a data frame with the backtesting results. The data frame contains the following columns: date, return_label, features, rolling, window_size, step_size, offset, in_sample, ml_config, model, predicted returns, actual realized returns, and errors for all predictions. The model column contains the name of the model that was used for the prediction. The prediction column contains the predicted returns. The actual column contains the actual returns. The error column contains the difference between the predicted and the actual returns. By providing the 'rp' object from before, we add an additional prediction.
```{r backtesting_returns_rf}
rp_rf <- backtesting_returns(data=data_ml, return_prediction_object=rp,
return_label, features, rolling=FALSE, window_size, step_size, offset, in_sample, rf_config, append=FALSE, num_cores=NULL)
```
Next we chack the prediction summary.
```{r analyse_predictions_rf}
rp_rf$predictions |> head()
rp_rf_stats <- summary(rp_rf)
print(rp_rf_stats)
```
Lets create portfolios again.
```{r backtesting_portfolios_rf}
pf_config <- list(predictions = c("xgb_2","rf_1"),
quantile_weight = list(pred_func="quantile_weights",
config1=list(quantiles = list(long = 0.20, short = 0.20),allow_short_sale = FALSE,
min_weight = -1, max_weight = 1, b = 1),
config2=list(quantiles = list(long = 0.10, short = 0.10),allow_short_sale = FALSE,
min_weight = -1, max_weight = 1, b = 1)))
pf_rf <- backtesting_portfolios(return_prediction_object = rp_rf, pf_config = pf_config)
```
And finally summarise the portfolio statistics.
```{r check_pf_rf}
plot(pf_rf)
pf_rf_stats <- summary(pf_rf)
print(pf_rf_stats)
```