Title: | Adaptive and Automatic Gradient Boosting Computations |
---|---|
Description: | Fast and automatic gradient tree boosting designed to avoid manual tuning and cross-validation by utilizing an information theoretic approach. This makes the algorithm adaptive to the dataset at hand; it is completely automatic, and with minimal worries of overfitting. Consequently, the speed-ups relative to state-of-the-art implementations can be in the thousands while mathematical and technical knowledge required on the user are minimized. |
Authors: | Berent Ånund Strømnes Lunde |
Maintainer: | Berent Ånund Strømnes Lunde <[email protected]> |
License: | GPL-3 |
Version: | 0.9.3 |
Built: | 2024-11-07 04:42:10 UTC |
Source: | https://github.com/cran/agtboost |
Adaptive and Automatic Gradient Boosting Computations
agtboost is a lightning fast gradient boosting library designed to avoid manual tuning and cross-validation by utilizing an information theoretic approach. This makes the algorithm adaptive to the dataset at hand; it is completely automatic, and with minimal worries of overfitting. Consequently, the speed-ups relative to state-of-the-art implementations are in the thousands while mathematical and technical knowledge required on the user are minimized.
Important functions:
gbt.train
: function for training an agtboost ensemble
predict.Rcpp_ENSEMBLE
: function for predicting from an agtboost ensemble
See individual function documentation for usage.
Berent Ånund Strømnes Lunde
caravan.train
and caravan.test
both contain a design
matrix with 85 columns and a response vector. The train set consists
of 70% of the data, with 4075 rows. The test set consists of the
remaining 30% with 1747 rows. The following references the documentation
within the ISLR package:
The original data contains 5822 real customer records. Each record
consists of 86 variables, containing sociodemographic data (variables
1-43) and product ownership (variables 44-86). The sociodemographic
data is derived from zip codes. All customers living in areas with the
same zip code have the same sociodemographic attributes. Variable 86
(Purchase
) indicates whether the customer purchased a caravan
insurance policy. Further information on the individual variables can
be obtained at http://www.liacs.nl/~putten/library/cc2000/data.html
caravan.train; caravan.test
caravan.train; caravan.test
Lists with a design matrix x
and response y
The data was originally supplied by Sentient Machine Research and was used in the CoIL Challenge 2000.
P. van der Putten and M. van Someren (eds) . CoIL Challenge
2000: The Insurance Company Case. Published by Sentient Machine
Research, Amsterdam. Also a Leiden Institute of Advanced Computer
Science Technical Report 2000-09. June 22, 2000. See
http://www.liacs.nl/~putten/library/cc2000/
P. van der Putten and M. van Someren. A Bias-Variance Analysis of a Real World Learning Problem: The CoIL Challenge 2000. Machine Learning, October 2004, vol. 57, iss. 1-2, pp. 177-195, Kluwer Academic Publishers
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013)
An Introduction to Statistical Learning with applications in R,
https://trevorhastie.github.io/ISLR/,
Springer-Verlag, New York
summary(caravan.train) summary(caravan.test)
summary(caravan.train) summary(caravan.test)
gbt.complexity
creates a list of hyperparameters from a model
gbt.complexity(model, type)
gbt.complexity(model, type)
model |
object or pointer to object of class |
type |
currently supports "xgboost" or "lightgbm" |
Returns the complexity of model
in terms of hyperparameters associated
to model type
.
list
with type
hyperparameters.
set.seed(123) library(agtboost) n <- 10000 xtr <- as.matrix(runif(n, 0, 4)) ytr <- rnorm(n, xtr, 1) xte <- as.matrix(runif(n, 0, 4)) yte <- rnorm(n, xte, 1) model <- gbt.train(ytr, xtr, learning_rate = 0.1) gbt.complexity(model, type="xgboost") gbt.complexity(model, type="lightgbm") ## See demo(topic="gbt-complexity", package="agtboost")
set.seed(123) library(agtboost) n <- 10000 xtr <- as.matrix(runif(n, 0, 4)) ytr <- rnorm(n, xtr, 1) xte <- as.matrix(runif(n, 0, 4)) yte <- rnorm(n, xte, 1) model <- gbt.train(ytr, xtr, learning_rate = 0.1) gbt.complexity(model, type="xgboost") gbt.complexity(model, type="lightgbm") ## See demo(topic="gbt-complexity", package="agtboost")
gbt.convergence
calculates loss of data over iterations in the model
gbt.convergence(object, y, x)
gbt.convergence(object, y, x)
object |
Object or pointer to object of class |
y |
response vector |
x |
design matrix for training. Must be of type |
Computes the loss on supplied data at each boosting iterations of the model passed as object. This may be used to visually test for overfitting on test data, or the converce, to check for underfitting or non-convergence.
vector
with $K+1$ elements with loss at each boosting iteration and at the first constant prediction
## Gaussian regression: x_tr <- as.matrix(runif(500, 0, 4)) y_tr <- rnorm(500, x_tr, 1) x_te <- as.matrix(runif(500, 0, 4)) y_te <- rnorm(500, x_te, 1) mod <- gbt.train(y_tr, x_tr) convergence <- gbt.convergence(mod, y_te, x_te) which.min(convergence) # Should be fairly similar to boosting iterations + 1 mod$get_num_trees() +1 # num_trees does not include initial prediction
## Gaussian regression: x_tr <- as.matrix(runif(500, 0, 4)) y_tr <- rnorm(500, x_tr, 1) x_te <- as.matrix(runif(500, 0, 4)) y_te <- rnorm(500, x_te, 1) mod <- gbt.train(y_tr, x_tr) convergence <- gbt.convergence(mod, y_te, x_te) which.min(convergence) # Should be fairly similar to boosting iterations + 1 mod$get_num_trees() +1 # num_trees does not include initial prediction
gbt.importance
creates a data.frame
of feature importance in a model
gbt.importance(feature_names, object)
gbt.importance(feature_names, object)
feature_names |
character vector of feature names |
object |
object or pointer to object of class |
Sums up "expected reduction" in generalization loss (scaled using learning_rate
)
at each node for each tree in the model, and attributes it to
the feature the node is split on. Returns result in terms of percents.
data.frame
with percentwise reduction in loss of total attributed to each feature.
## Load data data(caravan.train, package = "agtboost") train <- caravan.train mod <- gbt.train(train$y, train$x, loss_function = "logloss", verbose=10) feature_names <- colnames(train$x) imp <- gbt.importance(feature_names, mod) imp
## Load data data(caravan.train, package = "agtboost") train <- caravan.train mod <- gbt.train(train$y, train$x, loss_function = "logloss", verbose=10) feature_names <- colnames(train$x) imp <- gbt.importance(feature_names, mod) imp
gbt.ksval
transforms observations to U(0,1) if the model
is correct and performs a Kolmogorov-Smirnov test for uniformity.
gbt.ksval(object, y, x)
gbt.ksval(object, y, x)
object |
Object or pointer to object of class |
y |
Observations to be tested |
x |
design matrix for training. Must be of type |
Model validation of model passed as object
using observations y
.
Assuming the loss is a negative log-likelihood and thus a probabilistic model,
the transformation
is usually valid.
One parameter, , is given by the model. Remaining parameters
are estimated globally over feature space, assuming they are constant.
This then allow the above transformation to be exploited, so that the
Kolmogorov-Smirnov test for uniformity can be performed.
If the response is a count model (poisson
or negbinom
), the transformation
is used to obtain a continuous transformation to the unit interval, which, if the model is correct, will give standard uniform random variables.
Kolmogorov-Smirnov test of model
## Gaussian regression: x_tr <- as.matrix(runif(500, 0, 4)) y_tr <- rnorm(500, x_tr, 1) x_te <- as.matrix(runif(500, 0, 4)) y_te <- rnorm(500, x_te, 1) mod <- gbt.train(y_tr, x_tr) gbt.ksval(mod, y_te, x_te)
## Gaussian regression: x_tr <- as.matrix(runif(500, 0, 4)) y_tr <- rnorm(500, x_tr, 1) x_te <- as.matrix(runif(500, 0, 4)) y_te <- rnorm(500, x_te, 1) mod <- gbt.train(y_tr, x_tr) gbt.ksval(mod, y_te, x_te)
gbt.load
is an interface for loading a agtboost model.
gbt.load(file)
gbt.load(file)
file |
Valid file-path to a stored aGTBoost model |
The load function for agtboost. Loades a GTB model from a txt file.
Trained aGTBoost model.
gbt.save
is an interface for storing a agtboost model.
gbt.save(gbt_model, file)
gbt.save(gbt_model, file)
gbt_model |
Model object or pointer to object of class |
file |
Valid file-path |
The model-storage function for agtboost.
Saves a GTB model as a txt file. Might be retrieved using gbt.load
Txt file that can be loaded using gbt.load
.
gbt.train
is an interface for training an agtboost model.
gbt.train( y, x, learning_rate = 0.01, loss_function = "mse", nrounds = 50000, verbose = 0, gsub_compare, algorithm = "global_subset", previous_pred = NULL, weights = NULL, force_continued_learning = FALSE, offset = NULL, ... )
gbt.train( y, x, learning_rate = 0.01, loss_function = "mse", nrounds = 50000, verbose = 0, gsub_compare, algorithm = "global_subset", previous_pred = NULL, weights = NULL, force_continued_learning = FALSE, offset = NULL, ... )
y |
response vector for training. Must correspond to the design matrix |
x |
design matrix for training. Must be of type |
learning_rate |
control the learning rate: scale the contribution of each tree by a factor of |
loss_function |
specify the learning objective (loss function). Only pre-specified loss functions are currently supported.
|
nrounds |
a just-in-case max number of boosting iterations. Default: 50000 |
verbose |
Enable boosting tracing information at i-th iteration? Default: |
gsub_compare |
Deprecated. Boolean: Global-subset comparisons. |
algorithm |
specify the algorithm used for gradient tree boosting.
|
previous_pred |
prediction vector for training. Boosted training given predictions from another model. |
weights |
weights vector for scaling contributions of individual observations. Default |
force_continued_learning |
Boolean: |
offset |
add offset to the model g(mu) = offset + F(x). |
... |
additional parameters passed.
|
These are the training functions for an agtboost.
Explain the philosophy and the algorithm and a little math
gbt.train
learn trees with adaptive complexity given by an information criterion,
until the same (but scaled) information criterion tells the algorithm to stop. The data used
for training at each boosting iteration stems from a second order Taylor expansion to the loss
function, evaluated at predictions given by ensemble at the previous boosting iteration.
An object of class ENSEMBLE
with some or all of the following elements:
handle
a handle (pointer) to the agtboost model in memory.
initialPred
a field containing the initial prediction of the ensemble.
set_param
function for changing the parameters of the ensemble.
train
function for re-training (or from scratch) the ensemble directly on vector y
and design matrix x
.
predict
function for predicting observations given a design matrix
predict2
function as above, but takes a parameter max number of boosting ensemble iterations.
estimate_generalization_loss
function for calculating the (approximate) optimism of the ensemble.
get_num_trees
function returning the number of trees in the ensemble.
Berent Ånund Strømnes Lunde, Tore Selland Kleppe and Hans Julius Skaug, "An Information Criterion for Automatic Gradient Tree Boosting", 2020, https://arxiv.org/abs/2008.05926
## A simple gtb.train example with linear regression: x <- runif(500, 0, 4) y <- rnorm(500, x, 1) x.test <- runif(500, 0, 4) y.test <- rnorm(500, x.test, 1) mod <- gbt.train(y, as.matrix(x)) y.pred <- predict( mod, as.matrix( x.test ) ) plot(x.test, y.test) points(x.test, y.pred, col="red")
## A simple gtb.train example with linear regression: x <- runif(500, 0, 4) y <- rnorm(500, x, 1) x.test <- runif(500, 0, 4) y.test <- rnorm(500, x.test, 1) mod <- gbt.train(y, as.matrix(x)) y.pred <- predict( mod, as.matrix( x.test ) ) plot(x.test, y.test) points(x.test, y.pred, col="red")
predict
is an interface for predicting from a agtboost model.
## S3 method for class 'Rcpp_ENSEMBLE' predict(object, newdata, ...)
## S3 method for class 'Rcpp_ENSEMBLE' predict(object, newdata, ...)
object |
Object or pointer to object of class |
newdata |
Design matrix of data to be predicted. Type |
... |
additional parameters passed. Currently not in use. |
The prediction function for agtboost.
Using the generic predict
function in R is also possible, using the same arguments.
For regression or binary classification, it returns a vector of length nrows(newdata)
.
Berent Ånund Strømnes Lunde, Tore Selland Kleppe and Hans Julius Skaug, "An Information Criterion for Automatic Gradient Tree Boosting", 2020, https://arxiv.org/abs/2008.05926
## A simple gtb.train example with linear regression: x <- runif(500, 0, 4) y <- rnorm(500, x, 1) x.test <- runif(500, 0, 4) y.test <- rnorm(500, x.test, 1) mod <- gbt.train(y, as.matrix(x)) ## predict is overloaded y.pred <- predict( mod, as.matrix( x.test ) ) plot(x.test, y.test) points(x.test, y.pred, col="red")
## A simple gtb.train example with linear regression: x <- runif(500, 0, 4) y <- rnorm(500, x, 1) x.test <- runif(500, 0, 4) y.test <- rnorm(500, x.test, 1) mod <- gbt.train(y, as.matrix(x)) ## predict is overloaded y.pred <- predict( mod, as.matrix( x.test ) ) plot(x.test, y.test) points(x.test, y.pred, col="red")
predict
is an interface for predicting from a agtboost
model.
## S3 method for class 'Rcpp_GBT_COUNT_AUTO' predict(object, newdata, ...)
## S3 method for class 'Rcpp_GBT_COUNT_AUTO' predict(object, newdata, ...)
object |
Object or pointer to object of class |
newdata |
Design matrix of data to be predicted. Type |
... |
additional parameters passed. Currently not in use. |
The prediction function for agtboost
.
Using the generic predict
function in R is also possible, using the same arguments.
For regression or binary classification, it returns a vector of length nrows(newdata)
.
Berent Ånund Strømnes Lunde, Tore Selland Kleppe and Hans Julius Skaug, "An Information Criterion for Automatic Gradient Tree Boosting", 2020, https://arxiv.org/abs/2008.05926
## A simple gtb.train example with linear regression: ## Random generation of zero-inflated poisson 2+2
## A simple gtb.train example with linear regression: ## Random generation of zero-inflated poisson 2+2