#### Read Microsoft Word - SIAM_TransReg_Final.doc text version

Transform Regression and the Kolmogorov Superposition Theorem

Edwin Pednault

IBM T. J. Watson Research Center 1101 Kitchawan Road, P.O. Box 218 Yorktown Heights, NY 10598 USA [email protected]

Abstract This paper presents a new predictive modeling algorithm that draws inspiration from the Kolmogorov superposition theorem. An initial version of the algorithm is presented that combines gradient boosting, generalized additive models, and decision-tree methods to construct models that have the same overall mathematical structure as Kolmogorov's superposition equation. Improvements to the algorithm are then presented that significantly increase its rate of convergence. The resulting algorithm, dubbed "transform regression," generates surprisingly good models compared to those produced by the underlying decision-tree method when the latter is applied directly. Transform regression is highly scalable and a parallelized database-embedded version of the algorithm has been implemented as part of IBM DB2 Intelligent Miner Modeling. Keywords: Gradient boosting, Generalized additive modeling, Decision trees. 1. Introduction In many respects, decision trees and neural networks represent diametrically opposed classes of learning techniques. A strength of one is often a weakness of the other. Decision-tree methods approximate response surfaces by segmenting the input space into regions and using simple models within each region for local surface approximation. The strengths of decision-tree methods are that they are nonparametric, fully automated, and computationally efficient. Their weakness is that statistical estimation errors increase with the depth of trees, which ultimately limits the granularity of the surface approximation that can be achieved for fixed sized data. In contrast, neural network methods fit highly flexible families of nonlinear parametric functions to entire surfaces to construct global approximations. The strength of this approach is that it avoids the increase in estimation error that accompanies segmentation and local model fitting. The weakness is that fitting nonlinear parametric functions to data is computationally demanding, and these demands are exacerbated by the fact that several network architectures often need to be trained and evaluated in order to maximize predictive accuracy. This paper presents a new modeling approach that attempts to combine the strengths of the methods described above--specifically, the global fitting aspect of neural networks with the automated, computationally efficient, and nonparametric aspects of decision trees. To achieve this union, this new modeling method draws inspiration from the Kolmogorov superposition theorem: Theorem (Kolmogorov, 1957). For every integer dimension d 2, there exist continuous real functions hij(x) defined on the unit interval U = [0,1], such that for every continuous real function f(x1,...,xd) defined on the d-dimensional unit hypercube Ud, there exist real continuous functions gi(x) such that

2 d +1

f ( x1 ,..., xd ) =

i =1

gi

j =1

d

hij ( x j ) .

Stronger versions of this theorem have also been reported (Lorentz, 1962; Sprecher, 1965). The theorem is interesting because it states that even the most complex multivariate functions can be decomposed into combinations of univariate functions, thereby enabling cross-product interactions to be modeled without introducing cross-product terms in the model. Hecht-Nielson (1987) has noted that the superposition equation can be interpreted as a three-layer neural network and has suggested using the theorem as basis for understanding multilayer neural networks. Girosi and Poggio (1989), on the other hand, have criticized this suggestion for several reasons, one being that applying Kolmogorov's theorem would require the inductive learning of nonparametric activation functions. Neural network methods, by contrast, usually assume that the activation functions are given and the problem is to learn values for the weights that appear in the networks. Although the usual paradigm for training weights can be extended to incorporate the learning of smooth parametric activation functions (i.e., by including their parameters in the partial derivatives that are calculated during training), the incorporation of nonparametric learning methods into the training paradigm was seen as problematic.

35

Nonparametric learning, on the other hand, is a key strength of decision-tree methods. The learning of nonparametric activation functions thus provides a starting point for combining the global-fitting aspects neural network methods with nonparametric learning aspects of decision-tree methods. In the sections that follow, an initial algorithm is presented and then subsequently refined that uses decision-tree methods to inductively learn instantiations of the gi and hij functions that appear in Kolmogorov's superposition equation so as to make the equation a good predictor of underlying response surfaces. In this respect, the initial algorithm and its refinement are inspired by, but are not mathematically based upon, Kolmogorov's theorem. The gi and hij functions that are created by the algorithms presented here are quite different from those that are constructed in the various proofs of Kolmogorov's theorem and its variants. The latter are highly nonsmooth fractal functions that in some respects are comparable to hashing functions (Girosi and Poggio, 1989). Moreover, Kolmogorov's theorem requires that the hij functions be universal for a given dimension d; that is, the hij functions are fixed for each dimension d and only the gi functions depend on the specific function f. The initial algorithm presented below, on the other hand, heuristically constructs both gi and hij functions that depend on training data. It is important to note that neither universality nor the specific functional forms of the gi and hij functions that appear in the various proofs of Kolmogorov's theorem are necessary in order to satisfy the superposition equation. For example, consider the function f(x,y) = xy. This function can be rewritten in superpositional form as f(x,y) = 0.25(x + y)2 0.25(x y)2. In this case, h11(x) = h21(x) = x, h12(y) = y, h22(y) = y, hij(z) = 0 for i,j > 2, g1(z) = 0.25z2, g2(z) = 0.25z2, and gi(z) = 0 for i > 2. Although these particular gi and hij functions satisfy the superposition equation, they do not satisfy the preamble of the theorem because the above hij functions are not universal for all continuous functions f(x,y). Nor do the above gi and hij functions at all resemble the gi and hij functions that are constructed in the proofs of Kolmogorov's theorem and its variants. In general, for any given function f(x1,...,xd), there can exist a wide range of gi and hij functions that satisfy the superposition equation without satisfying the preamble of the theorem. Taking the above observations to heart, the initial algorithm presented below likewise ignores the preamble of Kolmogorov's theorem and instead focuses on the mathematical structure of the superposition equation itself. Decision-tree methods and greedy search heuristics are used to construct gi and hij functions based on training data in an attempt to make the superposition equation a good predictor. The approach contrasts with previous work on direct application of the superposition theorem (Neruda, Stdrý, & Drkosová, 2000; Sprecher 1996, 1997, 2002). One difficulty with direct application

is that the gi and hij functions that need to be constructed are extremely complex and entail very large computational overheads to implement, even when the target function is known (Neruda, Stdrý, & Drkosová, 2000). Another problem is that noisy data is highly problematic. The approach presented here avoids both these issues, but it is heuristic in nature. Although there are no mathematical guarantees of obtaining good predictive models using this approach, the initial algorithm and its refinements nevertheless produce very good results in practice. Thus, one of the contributions of this paper is to demonstrate that the mathematical form of the superposition theorem is interesting in and of itself, and can be heuristically exploited to obtain good predictive models. The initial algorithm is based on heuristically interpreting Kolmogorov's superposition equation as a gradient boosting model (Friedman, 2001, 2002) in which the base learner constructs generalized additive models (Hastie & Tibshirani, 1990) whose outputs are then nonlinearly transformed to remove systematic errors in their residuals. To provide the necessary background to motivate this interpretation, gradient boosting and generalized additive modeling are first briefly overviewed in Sections 2 and 3. The initial algorithm is then presented in Section 4. The initial algorithm, however, has very poor convergence properties. Sections 5 and 6 therefore present improvements to the initial algorithm to obtain much faster rates of convergence, yielding a new algorithm called transform regression. Faster convergence is achieved by modifying Friedman's gradient boosting framework so as to introduce a nonlinear form of Gram-Schmidt orthogonalization. The modification requires generalizing the mathematical forms of the models to allow constrained multivariate gi and hij functions to appear in the resulting models in order to implement the orthogonalization method. The resulting models thus depart from the pure univariate form of Kolmogorov's superposition equation, but the benefit is significantly improved convergence properties. The nonlinear Gram-Schmidt orthogonalization technique is another contribution of this paper, since it can be combined with other gradient boosting algorithms to obtain similar benefits, such as Friedman's (2001, 2002) gradient tree boosting algorithm. Section 7 presents evaluation results that compare the performance of the transform regression algorithm to the underlying decision-tree method that is employed. In the evaluation study, transform regression often produced better predictive models than the underlying decisiontree method when the latter was applied directly. This result is interesting because transform regression uses decision trees in a highly constrained manner. Section 8 discusses some of the details of a parallelized database-embedded implementation of transform

36

regression that was developed for IBM DB2 Intelligent When applying gradient boosting, overfitting can be an issue and some means for preventing overfitting must be Miner Modeling. employed (Friedman, 2001, 2002; Jiang, 2001, 2002). Section 9 presents conclusions and discusses possible For the algorithms presented in this paper, a portion of directions for future work. the training data is reserved as a holdout set which the base learner employs at each iteration to perform model selection for the residual models. Because it is also 2. Gradient boosting possible to overfit the data by adding too many boosting Gradient boosting (Friedman, 2001, 2002) is a method stages (Jiang, 2001, 2002), this same holdout set is also for making iterative improvements to regression-based used to prune the number of gradient boosting stages. predictive models. The method is similar to gradient The algorithms continue to add gradient boosting stages descent except that, instead of calculating gradient until a point is reached at which either the net directions in parameter space, a learning algorithm improvement obtained on the training data as a result of (called the base learner) is used to estimate gradient an iteration falls below a preset threshold, or the directions in function space. Whereas with conventional prediction error on the holdout set has increased steadily gradient descent each iteration contributes an additive for a preset number of iterations. The current model is update to the current vector of parameter values, with then pruned back to the boosting iteration that gradient boosting each iteration contributes an additive maximizes predictive accuracy on the holdout set. update to the current regression model. When the learning objective is to minimize total squared error, gradient boosting is equivalent to Jiang's LSBoost.Reg algorithm (Jiang, 2001, 2002) and to Mallat and Zhang's matching pursuit algorithm (Mallat and Zhang, 1993). In the special case of squared-error loss, the gradient directions in function space that are estimated by the base learner are models that predict the residuals of the current overall model. Model updating is then accomplished by summing the output of the current model with the output of the model for predicting the residuals, which has the effect of adding correction terms to the current model to improve its accuracy. The resulting gradient boosting algorithm is summarized in Table 1. If the base learner is able to perform a true least-squares fit on the residuals at each iteration, then the multiplying scalar will always be equal to one. In the general case of an arbitrary loss function, the "residual error" (a.k.a., pseudo-residuals) at each iteration is equal to the negative partial derivative of the loss function with respect to the model output for each data record. In this more general setting, line search is usually needed to optimize the value of .

Table 1. The gradient boosting method for minimizing squared error.

3. Generalized additive models Generalized additive models (Hastie & Tibshirani, 1990), is a method for constructing regression models of the form:

~ = y + y 0

h

j =1

d

j (x j )

.

Hastie and Tibshirani's backfitting algorithm is typically used to perform this modeling task. Backfitting assumes the availability of a learning algorithm called a smoother for estimating univariate functions. Traditional examples of smoothers include classical smoothing algorithms that use kernel functions to calculate weighted averages of training data, where the center of the kernel is the point at which the univariate function is to be estimated and the weights are given by the shapes of the kernel functions. However, in general, any learning algorithm can be used as a smoother, including decision tree methods. An attractive aspect of decision tree methods is that they can explicitly handle both numerical and categorical input features, whereas classical kernel smoothers require preprocessing to construct numerical encodings of categorical features. With the backfitting algorithm, the value of y0 would first be set to the mean of the target variable and a smoother would then be applied to successive input variables xj in round-robin fashion to iteratively (re)estimate the hj functions until convergence is achieved. The resulting algorithm is summarized in Table 2. Overfitting and loop termination can be handled in a similar fashion as for gradient boosting. The initialization of the hj functions can be accomplished by setting them to be zero everywhere. Alternatively, one can obtain very good initial estimates by applying a

Let the current model M be zero everywhere; Repeat until the current model M does not appreciably change: Use the base learner to construct a model R that predicts the residual error of M, ensuring that R does not overfit the data; Find a value for scalar that minimizes the loss (i.e., total squared error) for the model M + R; Update M M + R;

37

Table 2. The backfitting algorithm.

Set y0 equal to the mean of the target variable y; For j = 1,...,d initialize the function hj(xj); Repeat until the functions hj(xj) do not appreciably change: For j = 1,...,d do the following: Use the smoother to construct a model Hj(xj) that predicts the following target value using only input feature xj:

new target = y - y 0 -

models whose outputs are then nonlinearly transformed to remove systematic errors in their residuals. To motivate this interpretation, suppose that we are trying to infer a predictive model for y as function of inputs x1,...,xd given a set of noisy training data {x1,...,xd, y}. As a first attempt, we might try constructing a generalized additive model of the form

~ = y1

j =1

d

h1 j ( x j ) = y10 +

h^

j =1

d

1 j (x j )

,

(1)

h

k j

where

1 ^ h1 j ( x j ) = h1 j ( x j ) + d y10 .

k

(xk )

(2)

Update hj(xj)

Hj(xj);

Table 3. A greedy one-pass additive modeling algorithm.

This modeling task can be performed by first applying either the backfitting algorithm shown in Table 2 or the greedy one-pass algorithm shown in Table 3, and then distributing the additive constant y10 that is obtained equally among the transformed inputs as per Equation 2. Independent of whether backfitting or the greedy onepass algorithm is applied, residual nonlinearities can still exist in the relationship between the additive model output 1 and the target value y. To remove such nonlinearities, the same smoother used for additive modeling can again be applied, this time to linearize 1 with respect to y. The resulting combined model would then have the form

^ y1 = g1 ( ~1 ) = g1 y

Set y0 equal to the mean of the target variable y; For j = 1,...,d use the smoother to construct a model Hj(xj) that predicts the target value ( y - y 0 ) using only input feature xj: Calculate linear regression coefficients j such that j H j ( x j ) is a best predictor of ( y - y 0 ) ;

j

For j = 1,...,d set h j ( x j ) = j H j ( x j ) ;

(3) j =1 greedy one-pass approximation to backfitting that independently applies the smoother to each input xj and then combines the resulting models using linear To further improve the model, gradient boosting can be regression. This one-pass additive modeling algorithm is applied by using the above two-stage modeling technique as the base learner. The resulting gradient summarized in Table 3. boosting model would then have the form Remarkably, the greedy one-pass algorithm shown in Table 3 can often produce surprisingly good models in d d ~ = 1 practice without additional iterative backfitting. The one^ yi hij ( x j ) = hij ( x j ) + d yi 0 (4a) pass algorithm also has the advantage that overfitting j =1 j =1 can be controlled in the linear regression calculation, either via feature selection or by applying a d regularization method. This overfitting control is ^ yi = g i ( ~i ) = g i y hij ( x j ) (4b) available in addition to overfitting controls that may be j =1 provided by the smoother. Examples of the latter include kernel-width parameters in the case of classical kernel d smoothers and tree pruning in the case of decision-tree ^ ^ y= yi = gi hij ( x j ) . (4c) methods. i i j =1

1 j (x j )

h

d

.

(

)

Equations 4a and 4b define the stages in the resulting gradient boosting model. Equation 4a defines the As mentioned earlier, inspiration for the initial algorithm generalized additive models i that are constructed in presented below is based on interpreting Kolmogorov's each boosting stage, while Equation 4b defines the superposition equation as a gradient boosting model in boosting stage outputs i. Equation 4c defines the output which the base learner constructs generalized additive

4. An initial algorithm

38

8 6 4 2 0 -2 -4 -6 1 0.5 -1 -0.5 0 0 0.5 -0.5 1-1

8 6 4 2 0 -2 -4 -6 1 0.5 -1 -0.5 0 0 0.5 -0.5 1-1

(a)

8 6 4 2 0 -2 -4 -6 1 0.5 -1 -0.5 0 0 0.5 -0.5 1-1

(b)

8 6 4 2 0 -2 -4 -6 1 0.5 -1 -0.5 0 0 0.5 -0.5 1-1

(c)

(d)

Figure 1. An example of the modeling behavior of the initial algorithm. (a) The target function. (b) The model output after one gradient boosting stage. (c) After two boosting stages. (d) After ten boosting stages.

of the overall model, which is the sum of the boosting stage outputs i. The resulting algorithm will thus generate predictive models that have the same mathematical form as Kolmogorov's superposition equation. The algorithm itself is summarized in Table 4.

Table 4. The initial algorithm.

Let the current model M be zero everywhere; Repeat until the current model M does not appreciably change: At the i'th iteration, construct an additive model i that that predicts the residual error of M using the raw features x1,...,xd as input, ensuring that i does not overfit the data; Construct an additive model i that predicts the residual error of M using the output of model i as the only input feature, ensuring that i does not overfit the data; Update M M + i;

To demonstrate the behavior of the initial algorithm, the ProbE linear regression tree (LRT) algorithm (Natarajan & Pednault, 2002) was used as the smoother in combination with the one-pass greedy additive modeling algorithm shown in Table 3. The LRT algorithm constructs decision trees with multivariate linear regression models in the leaves. To prevent overfitting, LRT employs a combination of tree pruning and stepwise linear regression techniques to select both the size of the resulting tree and the variables that appear in the leaf models. Predictive accuracy on a holdout data set is used as the basis for making this selection. In its use as a smoother, the LRT algorithm was configured to construct splits only on the feature being transformed. In addition, in the case of numerical features, the feature being smoothed was also allowed to appear as a regression variable in the leaf models. The resulting hij functions are thus piecewise linear functions for numerical features xj, and piecewise constant functions for categorical features xj. For the linear regression operation in Table 3, forward stepwise regression was used for feature selection and the same holdout set used by LRT for tree pruning was used for feature pruning to prevent overfitting.

39

For illustration purposes, synthetic training data was generated using the following target function:

x y z = f ( x, y ) = x + y + sin sin 2 2 (5)

Target Value Z

8 6 4 2 0 -2 -4 -6 -1 -0.5 0 Input Feature X 0.5 1

Data was generated by sampling the above function in the region x, y [ -1,1] 2 at grid increments of 0.01. This data was then subsampled at increments of 0.1 to create a test set, with the remaining data randomly divided into a training set and a holdout set. Figure 1 illustrates the above target function and the predictions on the test set after one, two, and ten boosting stages. As can be seen in Figure 1, the algorithm is able to model the cross-product interaction expressed in Equation 5, but the convergence of the algorithm is very slow. Even after ten boosting stages, the root mean squared error is 0.239, which is quite large given that no noise was added to the training data. Nevertheless, Figure 1 does illustrate the appeal of the Kolmogorov superposition theorem, which is its implication that cross-product interactions can be modeled without explicitly introducing cross-product terms into the model. Figure 2 illustrates how the initial algorithm is able to accomplish the same solely by making use of the mathematical structure of the superposition equation. Figures 2a and 2b show scatter plots of the test data as viewed along the x and y input features, respectively. Also plotted in Figures 2a and 2b as solid curves are the feature transformations 1x(x) and 1y(y), respectively, that were constructed from the x and y inputs. Figure 2c shows a scatter plot of the test data as viewed along the additive model output 1, together with the output of the g1(1) function plotted as a solid curve.

(a)

8 6 4 2 0 -2 -4 -6 -1 -0.5 0 Input Feature Y 0.5 1

Target Value Z

(b)

8

As can be seen in Figures 2a and 2b, the first stage 6 feature transformations 1x(x) and 1y(y) extract only the 4 linear terms in the target function. From the point of view of these transformations, the cross-product 2 relationship appears as heteroskedastic noise. However, as shown in Figure 2c, from the point of view of the 0 additive model output 1, the cross-product relationship -2 appears as residual systematic error together with lower heteroskedastic noise. This residual systematic error is -4 modeled by the g1(1) transformation of the first boosting stage. The resulting output produces the first -6 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 approximation of the cross-product interaction shown in Additive Model Output 1 Figure 1b. As this example illustrates, the nonlinear (c) transformations gi in Equation 4b (and in Kolmogorov's theorem) have the effect of modeling cross-product Figure 2. The test data as seen from various points in the first interactions without requiring that explicit cross-product boosting stage. (a) Test data and h1x(x) plotted against the x terms be introduced into the models. axis. (b) Test data and h1y(y) plotted against the y axis. (c) Test

data and g1(1) plotted against the derived 1 axis.

Target Value Z

40

5. Orthogonalized, feed-forward gradient boosting

A necessary condition that must be satisfied to maximize the rate of convergence of gradient boosting for squarederror loss is that the boosting stage outputs must be mutually orthogonal. To see why, let Ri be the output of the i'th boosting stage. Then the current model Mi Although not as computationally efficient as the obtained after i boosting iterations is given by traditional Gram-Schmidt procedure, incremental orthogonalization can be readily accomplished simply by i replacing the R term in the gradient boosting algorithm Mi = jRj . shown in Table 1 with a linear regression of the current j =1 boosting stage output R and all previous boosting stage If some of the boosting stage outputs are not mutually outputs. Adding this step to the procedure produces the orthogonal, then under normal circumstances there will orthogonalized gradient boosting method shown in Table 5. In the case of squared-error loss, the line search exist coefficients i such that the model M i given by step in Table 5 to find an optimal scalar is actually not needed because this scalar will equal one by i construction. However, the line search is included in = Mi j j R j Table 5 to indicate how orthogonalized gradient boosting j =1 generalizes to arbitrary loss functions. When will have a strictly smaller squared error on the training orthogonalized gradient boosting is applied to the data than model Mi. These coefficients can be estimated univariate linear regression base learner described above, by performing a linear regression of the boosting stage an optimum rate of convergence is achieved and the outputs. Additional boosting iterations would therefore resulting overall algorithm is closely related to the need to be performed on model Mi just to match the QR decomposition algorithm for linear regression. squared error of M i . Hence, the rate of convergence will be suboptimal in this case. Table 5. Orthogonalized gradient boosting for squared-error

stage outputs. However, the usual Gram-Schmidt orthogonalization procedure assumes that all coordinates are specified up front and at the start of the procedure. Gradient boosting, on the other hand, constructs coordinates dynamically in a stagewise fashion, so an incremental orthogonalization procedure is needed.

To illustrate, suppose the base learner performs a loss. univariate linear regression using one and only one input, always picking the input that yields the smallest Let the current model M be zero everywhere; squared error. Gradient boosting applied to this base Repeat until the current model M does not appreciably learner produces an iterative algorithm for performing change: multivariate linear regression that is essentially At the i'th iteration, use the base learner to equivalent to coordinate descent. Suppose further that construct a model Ri that predicts the residual we have a training set comprising two inputs, x and y, error of M, ensuring that Ri does not overfit the and a target value f(x,y) = x+y with no noise added. If the data; inputs are orthogonal (i.e., if the dot product between x and y is zero), then only two boosting iterations will be Use least-squares linear regression to find needed to fit the target function to within round-off coefficients k, k i, such that k Rk best fits the error. However, if the inputs are not orthogonal, more residual error of M. than two iterations will be needed, and the rate of Find a value for scalar that minimizes the loss convergence will decrease as the degree to which they function (i.e., total squared error) for the model are not orthogonal increases (i.e., as the projection of M + k Rk; one input data vector onto the other increases). Update M M + k Rk; In order to improve the rate of convergence, the gradient boosting method needs to be modified so as to increase the degree of orthogonality between boosting stages. One obvious approach would be to apply Gram-Schmidt Although the orthogonalized gradient boosting method orthogonalization to the boosting stage outputs. Gram- in Table 5 is expedient, it does not address the Schmidt orthogonalization is a technique used in QR underlying source of non-orthogonality between decomposition and related linear regression algorithms. boosting stage outputs, which is the base learner itself. A Its purpose is to convert simple coordinate descent into more refined solution would be to strengthen the base an optimal search method for linear least-squares learner so that it produces models whose outputs are optimization by modifying the coordinate directions of already orthogonal with respect to previous boosting successive regressors. In the case of gradient boosting, stages. To accomplish that, the approach presented here the coordinate directions are the defined by the boosting involves first modifying the gradient boosting method

41

Table 6. Feed-forward gradient boosting for squared-error loss.

Table 7. Orthogonalized feed-forward gradient boosting for squared-error loss.

Let the current model M be zero everywhere; Repeat until the current model M does not appreciably change: At the i'th iteration, use the base learner to construct a model Ri that predicts the residual error of M using all previous boosting stage outputs R1,..., Ri-1 as additional inputs, while ensuring that Ri does not overfit the data; Find a value for scalar that minimizes the loss (i.e., total squared error) for the model M + Ri; Update M M + Ri; shown in Table 1 so that, at each iteration, all previous boosting stage outputs are made available to the base learner as additional inputs. This modification yields the feed-forward gradient boosting method shown in Table 6. The primary motivation for feed-forward gradient boosting is to enable the base learner to perform an implicit Gram-Schmidt orthogonalization instead of the explicit orthogonalization performed in Table 5. In particular, by making previous boosting stage outputs available to the base learner, it might be possible to modify the base learner so that it achieves a nonlinear form of Gram-Schmidt orthogonalization, in contrast to the linear orthogonalization step in Table 5. In the next section, it will be shown how such a modification can in fact be made to the additive-modeling base learner used in the initial algorithm.

Let the current model M be zero everywhere; Repeat until the current model M does not appreciably change: At the i'th iteration, use the base learner to construct a model Ri that predicts the residual error of M using the previous boosting stage outputs R1,..., Ri-1 as additional inputs, while ensuring that Ri does not overfit the data; Use least-squares linear regression to find coefficients k, k i, such that k Rk best fits the residual error of M. Find a value for scalar that minimizes the loss function (i.e., total squared error) for the model M + k Rk; Update M M + k Rk;

6. The transform regression algorithm

In the case of the initial algorithm presented in Section 4, the base learner itself can be modified to take full advantage of the outputs of previous boosting stages. In particular, two modifications are made to the initial algorithm in order to arrive at the transform regression algorithm. Because feed-forward gradient boosting permits boosting stage outputs to be used as input features to subsequent stages, the first modification is to convert the gi functions in Equation 4 into hij functions by eliminating Equation 4b and by using the outputs of the additive models in Equations 4a as input features to all subsequent gradient boosting stages. This modification is intended mainly to simplify the mathematical form of the resulting models by exploiting the fact that feedforward gradient boosting explicitly makes boosting stage outputs available to subsequent stages, which enables the hij functions to perform dual roles: transform the input features and transform the additive model outputs.

An additional but no less important effect of feedforward gradient boosting is that it further strengthens weak base learners by expanding their hypothesis spaces through function composition. If in the first iteration the base learner considers models of the form M1(x1,...,xd), then in the second iteration it will consider models of the form M2(x1,...,xd ,M1(x1,...,xd)). Unless the base learner's hypothesis space is closed under this type of function composition, feed-forward gradient boosting will have the side effect of expanding the base learner's hypothesis space without modifying its mode of operation. The strengthening of the base learner produced by this The second modification is to introduce multivariate hij expansion effect can potentially improve the accuracy of functions by further allowing the outputs of the additive the models that are constructed. models in Equations 4a to appear as additional inputs to Although feed-forward gradient boosting is motivated by the hij functions in all subsequent stages. This orthogonality considerations, the method steps defined in modification is intended to push the orthogonalization of Table 6 are not sufficient to guarantee orthogonality for model outputs all the way down to the construction of arbitrary base learners, since some base learners might the hij feature transformation functions. As discussed in not be able to make full use of the outputs of previous Section 4, the ProbE linear regression tree (LRT) boosting stages in order to achieve orthogonality. In such algorithm (Natarajan & Pednault, 2002) was used in the cases, the orthogonalized feed-forward gradient boosting initial algorithm to construct the univariate gi and hij functions that appear in Equation 4. The LRT algorithm, method shown in Table 7 can be employed. however, is capable of constructing multivariate linear

42

regression models in the leaves of trees, and not just Table 8. The transform regression algorithm. univariate linear regression models as was needed for the initial algorithm. By allowing previous boosting stage outputs to appear as regressors in the leaves of these Let the current model M be zero everywhere; trees, the output of each leaf model will then be Repeat until the current model M does not appreciably change: orthogonal to those previous boosting stage outputs when the leaf conditions are satisfied. Hence, the overall At the i'th iteration, construct an additive model nonlinear transformations defined by the trees will be i that that predicts the residual error of M using orthogonal to the previous boosting stage outputs and both the raw features x1,...,xd and the outputs any linear combination of the trees will likewise be 1,..., i-1 of all previous boosting stages as orthogonal. potential input features to i, allowing the feature transformation to vary as a function of the With the above changes, the mathematical form of the previous boosting stage outputs 1,..., i-1, and resulting transform regression models is given by the ensuring that i does not overfit the data; following system of equations: Update M M + i;

^ y1 =

transform regression algorithm compared to the initial algorithm when transform regression is applied to the d same data as in Figures 1 and 2. In this experiment, the ^ ^ ^ yi = hij x j y1 ,..., yi -1 ProbE linear regression tree (LRT) algorithm was again j =1 (6b) used, this time exploiting its ability to construct d + i -1 multivariate regression models in the leaves of trees. As ^ ^ ^ + hik (y k - d y1 ,..., yi -1 ) , i > 1 with the initial algorithm, one-pass greedy additive k = d +1 modeling was used with stepwise linear regression, with the holdout set used to estimate generalization error in ^ ^ y = yi , (6c) order to avoid overfitting. i As shown in Figure 3a, because the gi functions have been removed, the first stage of transform regression ^ ^ where the notation hij x j y1 ,..., yi -1 is used to indicate extracts the two linear terms of the target function, but that function hij is meant to be a nonlinear transformation not the cross-product term. As can be seen in Figure 3d, of xj and that this transformation is allowed to vary as a the first boosting stage therefore has a higher ^ ^ y1 ,..., yi -1 . Likewise for the approximation error than the first boosting stage of the function of ^ ^ ^ hik ( y k - d y1 ,..., yi -1 ) functions that appear in initial algorithm. However, for all subsequent boosting Equation 6b. Note that the latter are the counterparts to stages, transform regression outperforms the initial the gi functions in Equation 4b. In concrete terms, when algorithm, as can be seen in Figures 3b-d. As this applying the ProbE LRT algorithm to construct example demonstrates, by modifying the gradient ^ ^ ^ ^ ^ hij x j y1 ,..., yi -1 and hik ( y k - d y1 ,..., yi -1 ) , the LRT boosting algorithm and the base learner to achieve orthogonality of gradient boosting stage outputs, a algorithm is constrained to split only on the features dramatic increase the rate of convergence can be being transformed (i.e., xi and k-d, respectively) with all obtained. ^ ^ other inputs (i.e., y1 ,..., yi -1 ) allowed to appear as regressors in the linear regression models in the leaves of the resulting trees. Of course, the features being 7. Experimental evaluation transformed can likewise appear as regressors in the leaf Table 9 shows evaluation results that were obtained on models if they are numeric. Equation 6a corresponds to eight data sets that were used to compare the the first boosting stage while Equation 6b corresponds to performance of the transform regression algorithm to the all subsequent stages. Equation 6c defines the output of underlying LRT algorithm. Also shown are results for the overall model. The resulting algorithm is the first gradient boosting stage of transform regression, and for the stepwise linear regression algorithm that is summarized in Table 8. used both in the leaves of linear regression trees and in Although Equation 6 departs from the mathematical the greedy one-pass additive modeling method. The first form of Kolmogorov's superposition equation, the above four data sets are available from the UCI Machine modifications dramatically improve the rate of Learning Repository and the UCI KDD Archive. The convergence of the resulting algorithm. Figure 3 last four are internal IBM data sets. illustrates the increased rate of convergence of the

h

j =1

d

1 j (x j )

(6a)

(

)

(

)

(

)

43

8 6 4 2 0 -2 -4 -6 1 0.5 -1 -0.5 0 0 0.5 -0.5 1-1

8 6 4 2 0 -2 -4 -6 1 0.5 -1 -0.5 0 0 0.5 -0.5 1-1

(a)

3.0 2.5

(b)

8 6 4 2 0 -2 -4 -6 1 0.5 -1 -0.5 0 0 0.5 -0.5 1-1

2.0 RMS Error

Transform Regression

1.5 1.0 0.5 0.0

Initial Algorithm

1

2

3

4

5 6 Boosting Stage

7

8

9

10

(c)

(d)

Figure 3. An example of the modeling behavior of the transform regression algorithm. (a) Model output after one gradient boosting stage. (b) After two stages. (c) After three stages. (d) RMS errors of successive gradient boosting stages.

Because all data sets have nonnegative target values, and because all but one (i.e., KDDCup98 TargetD) have 0/1 Table 9. Gini coefficients for different data sets and algorithms. For each data set, the best coefficient is highlighted in bold, the target values, comparisons were made based on Gini coefficients of cumulative gains charts (Hand, 1997) second best in italics. estimated on holdout test sets. Cumulative gains charts FIRST LIN STEP (a.k.a., lift curves) are closely related to ROC curves, DATA TRANS REG LIN BOOST except that gains charts have the benefit of being SET REG TREES REG STAGE applicable to continuous nonnegative numeric target ADULT 0.655 0.559 0.566 0.429 values in addition to 0/1 categorical values. Gini COIL 0.431 0.382 0.311 0.373 coefficients are normalized areas under cumulative gains KDD-98 B 0.217 0.216 0.160 0.164 charts, where the normalization produces a value of zero KDD-98 D 0.157 0.140 0.102 0.000 for models that are no better than random guessing, and A 0.536 0.468 0.541 0.162 a value of one for perfect predictors. Gini coefficients D 0.536 0.543 0.447 0.409 are thus closely related to AUC measurements (i.e., the M 0.690 0.682 0.638 0.380 areas under ROC curves). Note, however, that models R 0.508 0.481 0.491 0.435 that are no better than random guessing have an AUC of 0.5 but a Gini coefficient of zero. As can be seen in Table 9, transform regression produces the LRT algorithm on a majority of the data sets. In one better models than the underlying LRT algorithm on all case, the first stage model is also better than the overall data sets but one, and for the one exception the LRT transform regression model, which indicates an model is only slightly better. Remarkably, the first overfitting problem with the prototype implementation gradient boosting stage also produces better models than used for these experiments.

44

8. Computational considerations

In addition to its other properties, transform regression can also be implemented as a computationally efficient parallelizable algorithm Such an implementation is achieved by combining the greedy one-pass additive modeling algorithm shown in Table 3 with the ProbE linear regression tree (LRT) algorithm (Natarajan & Pednault, 2002). As per Table 3, each boosting stage is calculated in two phases. The first phase constructs initial estimates of the feature transformation functions hij and hik that appear in Equations 6a and 6b. The second phase performs a stepwise linear regression on these initial feature transformations in order to select the most predictive transformations and to estimate their scaling coefficients, as per Table 3.

a large stepwise linear regression is performed instead. In this case, the number of regressors is equal to the number of transformation trees (i.e., the number of input features plus the iteration index minus one). As with the first phase, the mean vectors and covariance matrices that define the sufficient statistics for the linear regression can be calculated using only a single pass over the training and holdout data. The sufficient statistics can likewise be calculated in parallel on disjoint data partitions, with the results then merged using the same merging technique used for tree building.

The above implementation techniques produce a scalable and efficient algorithm. These techniques have been incorporated into a parallelized version of transform regression that is now available in IBM DB2 Intelligent Miner Modeling, which is IBM's database-embedded The ProbE LRT technology enables the computations for data mining product (Dorneich et al., 2006). the first phase to be performed using only a single pass over the data. It also enables the computations to be data-partition parallelized for scalability. The LRT 9. Conclusions algorithm incorporates a generalized version of the bottom-up merging technique used in the CHAID Although the experimental results presented above are algorithm (Biggs, deVille, and Suen 1991; Kass 1980). by no means an exhaustive evaluation, the consistency of Accordingly, multiway splits are first constructed for the results clearly demonstrate the benefits of the global each input feature. Next, data is scanned to estimate function-fitting approach of transform regression sufficient statistics for the leaf models in each multiway compared to the local fitting approach of the underlying split. Finally, leaf nodes and their sufficient statistics are linear regression tree (LRT) algorithm that is employed. merged in a bottom-up pairwise fashion to produce trees Transform regression uses the LRT algorithm to for each feature without further accessing the data. For construct a series of global functions that are then categorical features, the category values define the combined using linear regression. Although this use of multiway splits. For numerical features, the feature LRT is highly constrained, in many cases it enables values are discretized into intervals and these intervals better models to be constructed than with the pure local define the multiway splits. Although the CHAID method fitting of LRT. In this respect, transform regression considers only constant leaf models, the approach can be successfully combines the global-fitting aspects of generalized to include stepwise linear regression models learning methods such as neural networks with the in the leaves (Natarajan & Pednault, 2002). In the case nonparametric local-fitting aspects of decision trees. of linear regression, the sufficient statistics are mean Transform regression is also computationally efficient. vectors and covariance matrices. Only two passes over the data are required to construct By calculating sufficient statistics simultaneously for each boosting stage: one pass to build linear regression both training data and holdout data, the tree building and trees for all input features to a boosting stage, and tree pruning steps can be performed using only these another pass to perform the stepwise linear regression sufficient statistics without any further data access. that combines the outputs of the resulting trees to form Linear regression tree estimates of the hij and hik feature an additive model. The amount of computation that is transformation functions can therefore be calculated required per boosting stage is therefore between one to using only a single pass over both the training and two times the amount of computation needed by the LRT holdout data at each iteration. In addition, because the algorithm to build a single level of a conventional linear technique of merging sufficient statistics can be applied regression tree when the LRT algorithm is applied to any disjoint data partitions, the same merging method outside the transform regression framework. used during tree building can be used to merge sufficient Another aspect of transform regression is that it statistics that are calculated in parallel on disjoint data demonstrates how Friedman's gradient boosting partitions (Dorneich et al., 2006). This merging framework can be enhanced to strengthen the base capability enables data scans to be readily parallelized. learner and improve the rate of convergence. One In the first phase, the stepwise linear regression models enhancement is to use the outputs of boosting stages as that appear in the leaves of the feature transformation first-class input features to subsequent stages. This trees are relatively small. At each iteration, the modification has the effect of expanding the hypothesis maximum number of regressors is equal to the iteration space through function composition of boosting stage number. In the second phase, no trees are constructed but models. Another enhancement is to modify the base

45

learner so that it produces models whose outputs are linearly orthogonal to all previous boosting stage outputs. This orthogonality property improves the efficiency of the gradient descent search performed by the boosting algorithm, thereby increasing the rate of convergence. In the case of transform regression, this second modification involved using boosting stage outputs as additional multivariate inputs to the feature transformation functions hij and hik. This very same approach can likewise be used in combination with Friedman's tree boosting algorithm by replacing his conventional tree algorithm with the LRT algorithm. It should likewise be possible to extend the approach presented here to other tree-based boosting techniques.

Hecht-Nielsen, R. (1987). Kolmogorov's mapping neural network existence theorem. Proc. IEEE International Conference on Neural Networks, Vol. 3, 11-14. Jiang, W. (2001). Is regularization unnecessary for boosting? Proc. 8th Intl. Workshop on AI and Statistics, 57-64. San Mateo, California: Morgan Kaufmann. Jiang, W. (2002). On weak base hypotheses and their implications for boosting regression and classification. Annals of Statistics 30(1):51-73. Kass, G. V. (1980). An exploratory technique for investigating large quantities of categorical data. Applied Statistics 29(2):119-127.

Transform regression, however, is still a greedy hillclimbing algorithm. As such, it can get caught in local Kolmogorov, A.N. (1957). On the representation of continuous functions of many variables by minima and at saddle points. In order to avoid local superposition of continuous functions of one variable minima and saddle points entirely, additional research is and addition. Doklady Akademii Nauk SSSR, needed to further improve the transform regression 144(5):953-956. Translated in American Mathematical algorithm. Several authors (Krková, 1991, 1992; Society Translations Issue Series 2, 28:55-59 (1963). Neruda, Stdrý, & Drkosová, 2000; Sprecher 1996, 1997, 2002) have been investigating ways of Krková, V. (1991). Kolmogorov's theorem is relevant. overcoming the computational problems of directly Neural Computation 3(4):617-622. applying Kolmogorov's theorem. Given the strength of the results obtain here using the form of the Krková, V. (1992). Kolmogorov's theorem and multilayer neural networks. Neural Networks 5(3):501superposition equation alone, research aimed at creating 506. a combined approach could potentially be quite fruitful. Lorentz, G.G. (1962). Metric entropy, widths, and superposition of functions. American Mathematical References Monthly, 69:469-485. Biggs, D., de Ville, B., and Suen, E. (1991). A method of choosing multiway partitions for classification and Mallat, S.G. and Zhang, Z. (1993). Matching pursuits decision trees. Journal of Applied Statistics, 18(1):49with time-frequency dictionaries. IEEE Trans. Signal 62. Processing 41(12):3397-3415. Dorneich, A., Natarajan, R., Pednault, E., & Tipu, F. Natarajan, R. & Pednault, E.P.D. (2002). Segmented (2006). Embedded predictive modeling in a parallel regression estimators for massive data sets. Proc. relational database, to appear in Proceedings of the Second SIAM International Conference on Data 21st ACM Symposium on Applied Computing, April Mining, available online at www.siam.org. 2006, Dijon, France. Neruda, R., Stdrý, A., & Drkosová J. (2000). Towards Friedman, J.H. (2001). Greedy function approximation: feasible learning algorithm based on Kolmogorov a gradient boosting machine. Annals of Statistics theorem. Proc. International Conference on Artificial Intelligence, Vol. II, pp. 915-920. CSREA Press. 29(5):1189-1232. Friedman, J.H. (2002). Stochastic gradient boosting. Sprecher, D.A. (1965). On the structure of continuous functions of several variables. Transactions American Computational Statistics & Data Analysis 38(4):367Mathematical Society, 115(3):340-355. 378. Girosi, F. & Poggio, T. (1989). Representation Sprecher, D.A. (1996). A numerical implementation of properties of networks: Kolmogorov's theorem is Kolmogorov's superpositions. Neural Networks irrelevant. Neural Computation 1(4):465-469. 9(5):765-772. Hand, D.J. (1997). Construction and Assessment of Sprecher, D.A. (1997). A numerical implementation of Classification Rules. New York: John Wiley and Sons. Kolmogorov's superpositions II. Neural Networks 10(3):447-457. Hastie, T. & Tibshirani, R. (1990). Generalized Additive Models. New York: Chapman and Hall. Sprecher, D.A. (2002). Space-filling curves and Kolmogorov superposition-based neural networks. Neural Networks 15(1):57-67.

46

#### Information

#### Report File (DMCA)

Our content is added by our users. **We aim to remove reported files within 1 working day.** Please use this link to notify us:

Report this file as copyright or inappropriate

1016585

### You might also be interested in

^{BETA}