Научная статья на тему 'New approaches to regression in financial Mathematics by additive models'

New approaches to regression in financial Mathematics by additive models Текст научной статьи по специальности «Математика»

CC BY
245
63
i Надоели баннеры? Вы всегда можете отключить рекламу.
Область наук

Аннотация научной статьи по математике, автор научной работы — Taylan P., Weber G. -w

Additive models belong to the techniques of modern statistical learning; they are applicable in many areas of prediction such as financial mathematics, computational biology, medicine, chemistry and environmental protection. They are fitted through backfitting algorithm using the partials residuals proposed by Friedman and Stuetzle (1981) [9]. In this paper, we firstly give an introduction and review. Then, we present a mathematical modeling by splines based on a new clustering approach for the input data x, their density, and the variation of the output data y. We contribute to regression with additive models by bounding (penalizing) second order terms (curvature) of the splines, leading to a more robust approximation. Now, we propose a refining modification and investigation of the backfitting algorithm previously applied to additive models. By using the language of optimization theory, especially, conic quadratic programming, we initiate future research on and practical applications of solution methods with mathematical programming

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «New approaches to regression in financial Mathematics by additive models»

Вычислительные технологии

Том 12, № 2, 2007

NEW APPROACHES TO REGRESSION IN FINANCIAL MATHEMATICS BY ADDITIVE MODELS

P. Taylan

Dicle University, Department of Mathematics, Diyarbakir, Turkey Middle East Technical University, Institute of Applied Mathematics,

Ankara, Turkey, e-mail: [email protected]

G.-W. Weber

Middle East Technical University, Institute of Applied Mathematics,

Ankara, Turkey, e-mail: [email protected]

Аддитивные модели принадлежат к технике современного статистического познания, они применяются для прогнозирония во многих областях, таких как финансовая математика, вычислительная биология, медицина, химия и защита окружающей среды. Эти модели используются посредством алгоритма обратного фит-тинга, основанного на методе частичных остатков, предложенном Friedman and Stuetzle (1981). В этой статье мы сначала даем введение в проблему и обзор. Затем мы представляем моделирование сплайнами, основанное на новом кластерном подходе для входных данных, плотности этих данных и изменении выходных данных. Наш вклад в метод регрессии с аддитивными моделями состоит в ограничении членов, отвечающих за кривизну сплайна, что приводит к более надежной аппроксимации. Мы предлагаем усовершенствованную модификацию и исследование алгоритма обратного фиттинга применительно к аддитивным моделям. Используя язык теории оптимизации, в частности, метод конического квадратичного программирования, мы инициируем дальнейшие исследования и их практические приложения для программирования.

1. Introduction

1.1. Learning and Models

In the last decades, learning from data has become very important in every field of science, economy and technology, for problems concerning the public and the private life as well. Modern learning challenges can for example be found in the fields of computational biology and medicine, and in the financial sector. Learning enables for doing estimation and prediction.

© Институт вычислительных технологий Сибирского отделения Российской академии наук, 2007.

There are regression, mainly based on the idea of least squares or maximum likelihood estimation, and classification. In statistical learning, we are beginning with deterministic models and, then, we turn to the more general case of stochastic models where uncertainties, noise or measurement errors are taken into account. For a closer information we refer to the book Hastie, Tibshirani, Friedman [12]. In classical models, the approach to explain the recorded data y consists of one unknown function only; the introduction of additive models (Buja, Hastie, Tibshirani 1989 [5]) allowed an "ansatz" with a sum of functions which have separated input variables. In our paper, we figure out clusters of input data points x (or entire data points (x,y)), and assign an own function that additively contributes to the understanding and learning from the measured data. These functions over domains (e. g., intervals) depending on the cluster knots are mostly assumed to be splines. We will introduce an index useful for deciding about the spline degrees by density and variation properties of the corresponding data in x and y components, respectively. In a further step of refinement, aspects of stability and complexity of the problem are implied by keeping the curvatures of the model functions under some chosen bounds. The corresponding constrained least squares problem can, e.g., be treated as a penalized unconstrained minimization problem. In this paper, we specify (modify) the backfitting algorithm which contains curvature term and apply for additive models.

This paper contributes to both the m-dimensional case of input data separated by the model functions and, as our new alternative, to 1-dimensional input data clustered. Dimensional generalizations of the second interpretation and a combination of both interpretations are possible and indicated. Applicability for data classification is noted. We point out advantages and disadvantages of the concept of backfitting algorithm. By all of this, we initiate future research with a strong employing of optimization theory.

1.2. A Motivation of Regression

This paper has been motivated by the approximation of finanical data points (x,y), e.g., coming from the stock market. Here, x represents the input constellation, while y stands for the observed data. The discount function, denoted by 5(x), is the current price of a risk free, zero coupon bond paying unit of money at time x. We use y(x) to denote the zero-coupon yield curve and to f (x) to denote the instantaneous forward rate curve. These are related to the discount function by

The term interest rate curve can be used to refer to any one of these three related curves.

In a world with complete markets and no taxes or transaction, absence of arbitrage implies that the price of any coupon bond can be computed from an interest rate curve. In particular, if the principal and interest payment of a bond is Cj units of money at time xj (j = 1, ...,m), then the pricing equation for the bond is

#(x) = exp (—xy(x)) = exp

(1.1)

(1.2)

The interest rate curve can be estimated if given a set of bond prices. For this reason, let (Bi)i=1 . comprise the bonds, X1 < X2 < ... < Xm be the set of dates at which principal and interest payments occur, let cj be the principal and interest payment of the ith bond on date Xj, and Pi be the observed price of the ith bond. The pricing equation is

Pi = P + £i, (1.3)

where Pi is defined by Pi = cj 6 (Xj) (Waggoner 1997 [22]). The curves of discount

j=1

6(x), yield y(x) and forward rate f (x) can be extracted via linear regression, regression with splines, smoothing splines, etc., using prices of coupon bond. For example, assuming P := (Pi,...,Pn)T and C := (cj)i

= 1,...,N;j=1,...,m to be known, denoting the vector of errors or

residuals (i.e., noise, inaccuracies and data uncertainties) by £ := (e1 , ...,£N)T and writing fl := 8 (X) = (6(X1),..., 6(Xm))T, then the

pricing equation looks as follows:

P = Cfl + £. (1.4)

Thus, the equation (1.4) can be seen as linear model with the unknown parameter vector (6(X1),...,6(Xm))T = fl. If

we use linear regression methods or maximum likelihood estimation and, in many important cases, just least squares estimation, then we can extract 8 (X). For introductory and closer information about these methods from the viewpoints of statistical learning or the theory of inverse problems, we refer to the books of Hastie, Tibshirani, Friedman [12] and Aster, Borchers, Thurber [2], respectively.

1.3. Regression 1.3.1. Linear Regression

Linear regression models the relation between two variables by fitting a linear model to observed data. One variable is considered to be an input (explanatory) variable x, and the other is considered to be an output (dependent) variable y. Before attempting to do this fitting, the modeller firstly decides whether at all there is a relationship between x and y. This does not necessarily imply that one variable causes the other, but that there is some significant association between the two variables. A scatterplot can be a helpful tool in determining the strength of the relationship between two variables [7].

Provided an input vector X = (X1,..., Xm)T of (random) variables and an output variable Y, our linear regression model has the form

m

Y = E(Y| X1,..., Xm) + £ = f (X) + £ = 0o + £ Xj+ £. (1.5)

j=1

Here, ^j are unknown parameters or coefficients, the error £ is a Gaussian random variable with expectation 0 and variance a2, in short: £ ~ N(0, a2), and the variables Xj can be from different sources. We denote the estimation of f by f. The most popular estimation method is least squares approximation which determines the coefficient vector fl = (^0, ..., ^m)T to minimize the residual sum of squares via observation values (yi,xi),

N

RSS (0) :=£ (y - xf 0) 2

i=1

(1.6)

or, in matrix notation,

RSS (0) = (y - X0)T (y - X0). (1.7)

Here, X is the N x (m + 1) matrix with each row being an input vector (with entries 1 in the first column), and y is the N vector of outputs in the training set. Equation (1.7) is a quadratic function in m + 1 unknown parameters. If N > m +1 and X has full rank, then our solution vector ft which minimizes RSS is 0 = (XTX) XTy. The fitted values at the training inputs are y = X/3 = X (XTX) 1 XTy, where y^ = f(x^) = xT/1

1.3.2. Regression with Splines

In the above regression model, sometimes

f (X) = E (Y | Xi,...,Xm) (1.8)

can be nonlinear and nonadditive. Since, however, a linear model is easy to interpret, we want to represent f(X) by a linear model. Therefore, an approximation by a first-order Taylor approximation to f (X) can be used and sometimes even needs to be done. In fact, if N is small or m large, a linear model might be all we are able to use for data fitting without overfitting.

Regression with splines is a very popular method as for moving beyond linearity [12]. Here, we expand or replace the vector of inputs X with additional variables, which are transformations of X and, then, we use linear models in this new space of derived input features. Let X vector of inputs and hl : Rm ^ R be the jth transformation of X or basis function (/ = 1,..., M). Then, f (X) is modelled by

M

f (X ) = £ 0i hi (X), (1.9)

i=i

a linear basis expansion in X. Herewith, the model has become linear in these new variables and the fitting proceeds as in the case of a linear model. In fact, the estimation of 0 is

0 = (HT(x)H(x))-1 HT (x) Y, (1.10)

where H(x) = (hl(xi))i=1)...)N; is the matrix of basis functions evaluated at the input data.

i=1,...,M

Hence, f (X) becomes estimated by f(X) = hT(X)0. For the special case hl(X) = Xl (l = 1,...,M) the linear model is recovered. Generally, in one dimension (m = 1), an order M spline with knots = 1,...,K) is piecewise polynomial of degree M — 1, and has continuous derivatives up to order M — 2. A cubic spline has M = 4. Any piecewise constant function is an order 1 spline, while the continuous piecewise linear function is an order 2 spline. Likewise the general form for the truncated-power basis set would be hl(X) = Xl-1 (l = 1,..., M) and hM+T(X) = (X — )M-1 (t = 1,..., K), where (t)+ stands for the positive part of a value t [12].

Fixed-knot splines are also called regression splines. It is necessary to select the order of the spline, the number of knots and their placement. One simple approach is to parameterize a family of splines by the number of basis elements or degrees of freedom and let the observations x^ determine the positions of the knots. We shall follow the latter approach

in Section 2; there, we shall define a special index for the selection of the spline degrees and, herewith, their orders.

Since the space of spline functions of a particular order and knot sequence is a vector space, there are many equivalent bases for representing them. Truncated power bases, being so conceptually simple, are not attractive numerically, because they can allow big rounding problems. The B-spline basis allows for efficient computations even when the number of knots K is large. For basic information about higher and 1-dimensional splines, we refer to [6].

1.4. Additive Models 1.4.1. Classical Additive Models

We stated that regression models, especially, linear ones, are very important in many applied areas. However, the traditional linear models often fail in real life, since many effects are generally nonlinear. Therefore, flexible statistical methods have to be used to characterize nonlinear regression effects; among these methods is non-parametric regression (Fox 2002 [8]). But, if the number of independent variables is large in the models, many forms of nonparametric regression do not perform well; in those cases, the number of them must be diminished. This decreasing causes the variances of the estimates to be unacceptably large. It is also difficult to interprete nonparametric regression depending on smoothing spline estimates. To overcome these difficulties, Stone (1985) [21] proposed additive models. These models estimate an additive approximation of the multivariate regression function. Here, the estimation of the individual terms explains how the dependent variable changes with the corresponding independent variables. We refer to Hastie and Tibshirani (1986) [10] for basic elements of the theory of additive models.

If we have data consisting of N realizations of random variable Y at m design values, enumerated by the index i, then the additive model takes the from

m

E ( Y| Xii,...,Xim) = 00 + fj (Xij). (1.11)

j=1

The functions fj are unknown arbitrary and smooth functions and they are mostly considered to be splines, i.e., piecewise polynomial, since, e.g., polynomials themselves have a too strong or early asymptotics to and, by this, they may not be satisfying for data fitting. As we shall explain, these functions are estimated by a smoothing on single ("separated") coordinates or clusters. A standard convention is to assume at Xj : E (fj (xj)) = 0 since, otherwise, there will be a free constant in each of the functions [13]. Here, Xj is the jth coordinate of the ith input data vector; later on, in the backfitting algorithm, these values also serve as the knots of the interpolating (or smoothing) splines which appear there. The estimation of the fj is done by an algorithm which performs a stepwise smoothing with respect to suitably chosen spline classes, to the points Xj, and to the differences r^ between an average y of the observed output data y^ and the sum of our functions evaluated at the interpolation knots Xj.

In our paper, we will introduce a new interpretation: We consider Xj consisting of m variates (coordinates of X) as a value of the j th one of m disjoint clusters of input data which we achieve; the cluster elements are enumerated by Xj. That is, there is the understanding of ith data in the j th component of the input variable (classical separation of variable

approach), and there is the new understanding of the ith points of the jth cluster (Ij) of input data. If these clusters have the same cardinality, which we shall assume without loss of generality, we denote it by N. With this new interpretation we will find and refer to structures, e. g., regularly in time repeating input constellations. We will "separate" them by clusters which we call intervals (maybe in higher-dimensional sense). Here, "regular" means that we can see or without of generality assume correspondences between the sample times in these time intervals, e. g., there are mondays, tuesdays, ... We recall that the output values corresponding to the inputs are denoted by yij. Averaging over these values with respect

m

to j, delivering yi := YI yij (i = 1,N), will then represent, e. g., an observation mean over j=i

the mondays, tuesdays, etc., respectively. Since our explanations hold for the understanding in the sense of "separation of variables" and the one of "(separated) clusters" as well, we may keep both of them in mind and refer to these two ones in the following.

Additive models have a strong motivation as a useful data analytic tool. Each variable is represented separately in (1.11) and the model has an important interpretation feature of some "linear model": Each of the variables separately effects the response surface and that effect does not depend on the other variables. Each function fj is estimated by an algorithm proposed by Friedman and Stuetzle (1981) [9] and called backfitting algorithm. As the estimator for 00, the arithmetic mean (average) of the output data is used: ave(yi | i =

N

1,..., N) := (1/N) yi. This procedure depends on the partial residual against x.

j-

i=1

rij = yi - 00 - ^ f k (Xik) ,

(1.12)

k=j

and consists of estimating each smooth function by holding all the other ones fixed. In a framework of cycling from one to the next iteration, this means the following [11]: initialization: 00 = ave(yi | i = 1,..., N), fj(xj) = 0 Vi, j; cycle j = 1, 2,...,m, 1, 2,...,m, 1, 2,...,m,...,

rij = yi - 00 - f k (Xik), i = 1, k=j

fk is updated by smoothing the partials residuals,

N,

rij = yi - 00 - Y, fk (Xik), i k=j

1,...,N, against xij until the functions almost do not

change.

The backfitting procedure is also called Gauss — Seidel algorithm. To prove its convergence, Buja and Hastie [5] reduced the problem to the solution of a corresponding homogeneous system, analyzed by a linear fixed point equation of the form Tf = f. In fact, to represent the effect on the homogeneous equations of updating the j th component under Gauss — Seidel algorithm, the authors introduced the linear transformation

fi

Tj : IR

Nm

IR

Nm

Sj

fm

- fk k=j

(1.13)

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

m

—>

A full cycle of this algorithm is determined by T = TmTm-1...T1; then, T1 correspond l full cycles. Here, Sj is a smoothing spline operator which performs the interpolation at the knots Xj. In the case where all smoothing Sj are symmetric and have eigenvalues in [0, 1], then the backfitting algorithm always converges. In Subsection 2.5, we will come back closer to the algorithm and the denotation used here.

1.4.2. Additive Models Revisited

In our paper, we allow a different and new motivation: In addition to the approach given by a separation of the variables Xj done by the functions fj, now we perform a clustering of the input data of the variable x by a partitioning of the domain into higher dimensional Qj or, in the 1-dimensional case: intervals , and a determination of fj with reference to the knots lying in Qj (or Ij), respectively. In any such a case, a cube or interval is taking the place of a dimension or coordinate axis. We will mostly refer to the case of one dimension; the higher dimensional case can then be treated by a combination of separation and clustering. That clustering can incorporate any kind of periods of seasons assumed, any comparability or correspondence of successive time intervals, etc. Herewith, the functions fj are more considered as allocated to sets Ij (or Qj) rather than depending on some special, sometimes arbitrary elements of those sets (input data) or output values associated. This new interpretation and usage of additive models (or generalized ones, introduced next) is a key step of this paper.

1.5. A Note on Generalized Additive Models

1.5.1. Introduction

To extend the additive model to a wide range of distribution families, Hastie and Tibshirani (1990) [13] proposed generalized additive models (GAM) which are among the most practically used modern statistical techniques. Many often used statistical models belong to this general class, e. g., additive models for Gaussian data, nonparametric logistic models for binary data, and nonparametric log-linear models for Poisson data.

1.5.2. Definition of a Generalized Additive Model

If we have m covariates (or values of clusters) X1 , X2, ...,Xm, comprised by the m-tuple X = (X1 ,...,Xm)T, and a response Y to the input X assumed to the have exponential family density (y,a,w) with the mean ^ = E (Y | X1,...,Xm) linked to the predictors through a link function G. Here, a is called the natural parameter and w is the dispersion parameter. Then, in our regression setting, a generalized additive model takes the form

m

G(MX)) = ^ (X) = + fj (Xj). (1.14)

j=1

Here, the function fj are unspecified ("nonparametric") and 9 = (0o, f1,..., fm)T is the unknown parameter to be estimated; G is the link function. The incorporation 0o as some average outcome allows us to assume E (fj (Xj)) = 0 (j = 1,...,m). Often, the unknown functions fj are elements of a finite dimensional space consisting, e. g., of splines and these functions depending on the cluster knots are mostly assumed to be splines; the spline orders

(or degrees) are suitably choosen depending on the density and variation properties of the corresponding data in x and y components, respectively. Then, our problem of specifying 9 becomes a finite-dimensional parameter estimation problem. In future research, we will extend the new methods of this paper to generalized additive models.

2. Investigation of the Additive Model

In this section, we analytically and numerically investigate the additive model. Before we introduce and study our modified backfitting algorithm, we approach the input and output data and address the aspect of stability.

2.1. Clustering of Input Data

2.1.1. Introduction

Clustering is the process of organizing objects into groups I2,Im or, higher dimensionally: Qi, Q2, •••, Qm, whose elements are similar in some way. A cluster is therefore a collection of objects which are "similar" between them and are "dissimilar" to the objects belonging to other clusters. We put two or more objects belonging to the same cluster if they are "close" according to a given distance (in this case, geometrical distance) [2, 14]. In this paper, differently from the classical understanding, we always interpret clustering as being accompanied by a partitioning of the (input) space, including space coverage. In other words, it will mean a classification in the absense of different labels or categories. The aim of clustering is to determine the intrinsic grouping in a set of unlabeled data. Therefore, we decide about clustering methods which depend on a criterion. This criterion must be supplied by the user, in such a way that the result of the clustering will suit his needs [16]. Clustering algorithms can be applied in many fields like marketing, biology, libraries, book ordering, insurance, city-planning or earthquake studies. For further information we refer to [3].

For each clusters, namely, Ij (or Qj), we denote the elements by Xj. This interpretation is new, since according to the classical understanding of additive models, there is a separation of variables made by the model functions added and Xj is just the jth coordinate of the ith input data vector.

2.1.2. Clustering for Additive Models

Financial markets have different kinds of trading activities. These activities work with considerably long horizons, ranging from days and weeks to months and years. For this reason, we may have any kind of data. These data can sometimes be problematic for being used at the models, for example, given a longer horizon with sometimes less frequent data recorded, but to other times highly frequent measurements. In those cases, by the differences in data density and, possibly, data variation, the underlying reality and the following model will be too unstable or inhomogeneous. The process may be depending on unpredictable market behaviour or external events like naturally calamity. Sometimes, the structure of data is has particular properties. These may be a larger variability or a handful of outliers. Sometimes we do not have any meaningful data. For instance, share price changes will not be available when stock markets are closed at weekends or holidays.

The models used need to be able to cope with such values, inventing values to fill such gaps is not a good way to proceed. Government and private research groups moved from sampling and analyzing financial data annually to monthly, to weekly, to daily, and, now, intradaily. One could choose to aggregate the financial data to fixed intervals of time, justifying this action by the large number of recording errors and the transient reactions to news that occur during intense trading periods. Naturally, some information would be lost due to aggregation. Too large or small an aggregating interval, or the changing of the interval when it should remain constant, could cause information to be lost or distorted. Methods which address the irregularity of ultra-high frequency financial data are needed.

The following three parts of Fig. 1 are showing some important cases of input data distribution and clustering: the equidistant case (cf. (a)) where all points can be put into one cluster (or interval) /1, the equidistant case with regular breaks (weekends, holidays, etc.; cf. (b) where the regularly neighbouring points and the free days could be put in separate cluster intervals /j, and the general case (cf. (c)) where there are many interval /j of different

Fig. 1. Three important cases of input data distribution and its clustering: equidistance (a), equidistance with breaks (b), and general case (c).

a

b

c

Fig. 2. Example of a data (scatterplot); here, we refer to case (c) of Fig. 1.

interval lengths and densities. We remark that we could also include properties of the output data y into this clustering; for the ease of exposition, however, we disregard this aspect.

In the following, we will take into account the data variation; to get and impression of this, please have a look at Fig. 2.

For the sake of simplicity, we assume from now on that the number Nj of input data points Xj in each cluster Ij is the same, say, Nj = N (j = 1, ...,m). Otherwise there will be no approximation need at data points missing and the residuals of our approximation were 0 there. Furthermore, given the output data yj we denote the aggregated value over the all the ith output values of the clusters by

m

y := yij (i = 1,-,N)-j=i

In the example of case (1, b), this data summation refers to all the days i from monday to friday. Herewith, the cluster can also have a chronolocial meaning. By definition, up to the division by m, the values y^ are averages of the output values yj.

Before we come to a closer understanding of data density and variation, we proceed with our introduction of splines. In fact, the selection of the splines orders, degrees and classes will essentially be influenced by indices based on densities and variations (Subsection 2.3).

2.2. Interpolating Splines

Let x1j,x2j, •••,xNj be N distinct knots of [a, b], where a < x1j < x2j < ... < xNj < b. The function fk(x) on the interval [a, b] (or in R) is a spline of some degree k relative to the knotsxj if [18]

(1) fk|[xj, xi+1 j] £ (polynomial of degree < k; i =1,..., N — 1),

(2) fk £ Ck-1 [a, b].

Here, Ck-1 [a, b] is the set of functions defined on the interval [a, b] which can be extended on an open neighbourhood of the interval such that it is (k—1)-times continuously differentiable at each element of the neighbourhood. To characterize a spline of degree k, fk;i := fk |[xj, xi+1 j ] can be represented by

k

fk,i(x) = ^¿¿(x — xij)l (x £ [xij,xi+1j]). l=0

There are (k + 1)(N — 1) coefficients gu to be determined. Furthermore, it has to hold fS-1,(xij) = fS(xij) (i = 1,..., N — 2; l = 0,..., k — 1). Then, there are k(N — 2) conditions, and the remaining degrees of freedom are (k + 1)(N — 1) — k(N — 2) = k + N — 1 [18].

2.3. Variation and Density

Density is a measure of mass per unit of volume. The higher an object's density, the higher its mass per volume. Let us assume that we have /i,..., intervals; then, the density of the input data Xj in the jth interval /j is defined by

number of point Xj in /j

length of /

This definitions can be directly generalized to the higher dimensional interval rather than intervals Ij, by referring to the higher dimensional volumes.

Variation is a quantifiable difference between individual measurements. Every repeatable process exhibits variation. If over the interval Ij we have the data (xj, yy),..., (xNj, yNj), then the variation of these data refers to the output dimension y and it is defined as

N -1

V:= Y1 lyi+ij- yij |. i=1

Since we do the spline interpolation in the course of the algorithm with respect to

the residuals rj, we can also in every iteration separately refer to a variation, defined by N-1

Vj := |ri+1j — rj|; for simplicity, we suppressed the iteration index. The change of the i=1

reference outputs could be made after some iterations. We point out that this policy and the determination of the spline degrees discussed in Subsection 2.4 are left to the practitioner who is looking at the given data and follows the algorithm or, if a closed model is preferred rather than adaptive elements, they can be decided based on thresholds.

If the variation is big, at many data points the rate of change of the angle between any approximating curve and its tangent would be big, i.e., its curvature could be big. Otherwise, the curvature could be expected to be small. In this sense, high curvature over an interval can mean a highly oscillating behaviour. The occurrence of outliers yj (of rj) may contribute to this very much and mean instability of the model.

2.4. Index of Data Variation

Still we assume that /1,...,/m (or Q1,...,Qm) are the intervals (or cubes) according to the data grouped. For each interval Ij (cube Qj), we define the associated index of data variation by

Indj := Dj Vj

or, more generally,

Indj := dj(Dj)vj(Vj),

where dj, Vj are some positive, strongly monotonically increasing functions selected by the modeller. In fact, from both the viewpoints of data fitting and complexity (or stability), cases with a high variation distributed over a very long intervall are very much less problematic than cases with a high variation over a short intervall. The multiplication of variation terms with density terms due to each interval found by clustering is representing this difference.

We determine the degree of the splines fj with the help of the numbers Indj. If such an index is low, then we can choose the spline degree (or order) to be small. In this case, the spline may have a few coefficients to be determined and we can find these coefficients easily using any appropriate solution method for the corresponding spline equations. If the number Indj is big, then we must choose a high degree of the spline. In this case, the spline may have a more complex structure and many coefficients have to be determined; i.e., we may have many system equations or a high dimensional vector of unknows; to solve this could become very difficult. Also, a high degree of splines f1, f2,..., fm, respectively, causes high curvatures or oscillations, i.e., there is a high "energy" implied; this means a higher (co)variance or instability under data perturbations. As the extremal case of high curvature

we consider nonsmoothness meaning an instantaneous movement at a point which does not obey to any tangent.

The previous words introduced a model-free element into our explanations. Indeed, as indicated in Subsection 2.3, the concrete determining of the spline degree can be done adaptively by the implementer who writes the code. From a close mathematical perspective we propose to introduce discrete thresholds yv and to assign to all the intervals of indices Ind £ [yv,Yv+i) the same specific spline degrees. This determination and allocation has to base on the above reflections and data (or residuals) given.

For the above reasons, we want to impose some control on the oscillation. To make the oscillation smaller, the curvature of each spline must be bounded by the penalty parameter. We introduce a penalty parameter into the criterion of minimizing RSS, called penalized sum or squares PRRS now:

N ( m ^ 2 m b

PRSS(0o, /1,..., fm) := £ y - 0o - £ /(Xij) + £ n /'(j)]2dtj. (2.1)

¿=1 i /=1 j /=1 a

The first term measures the goodness of data fitting, while the second term is a penalty term and defined by means of the functions' curvatures. Here, the interval [a, b] is the union of all the intervals Ij. In the case of separation of variables, the interval bounds may also depend on j, i.e., they are [a/, bj]. For the sake of simplicity, we sometimes just write "J" and refer to the interval limits given by the context. There are also further refined curvature measures, especially, one with the input knot distribution implied by Gaussian bell-shaped density functions; these appear as additional factors in the integrals and have a cutting-off effect. For the sake of simplicity, we shall focus on the given standard one now and turn to the sophisticated model in a later study.

We also note that the definition of PRSS in (2.1) can be extended to the higher dimensional case by using the corresponding higher dimensional integrals. However, one basic idea of the (generalized) additive models just consists in the separation of the variables.

In (2.1), > 0 are tuning or smoothing parameters and they represent a tradeoff between first and second term. Large values of yield smoother curves, smaller values result in more fluctuation. It can be shown that the minimizer of PRRS is an additive spline model: Each of the functions /j is a spline in the component Xj, with knots at Xj (i =1,...,N). However, without further restrictions on the model, the solution is not unique. The constant 0O is not identifiable since we can add or substract any constants to each of the functions /j, and adjust 0O accordingly. For example, one standard convention is to assume

m

that /j (x/) = 0 Vi, the function average being zero over the corresponding data (e. g., of j=1

mondays, tuesdays, ..., fridays, respectively). In this case, 0O = ave(yi|i=1,...,N), as can be seen easily.

We firstly want to have

Nm

jNy* - 0o /j (xij ^ ~ 0 ¿=1 I j=i

and, secondly,

m

V^ I r ,n,. M2

/'(j)] 2dtj

j j j

j=1

0

or being sufficiently small, at least bounded. In the backfitting algorithm, these approximations, considered as equations, will give rise to expected or update formulas. For these requests, let us introduce

N ( m \ 2

F(0° ,/) := £ |Vi - 0° - j=J / (Xij ) J and g (/) := J [/j(tj)] - M,,

where / = (/,...,/m)T. The terms gj (/) can be interpreted as curvature integral values minus some prescribed upper bounds Mj > 0. Now, the combined standard form of our regression problem subject to the constrained curvature condition

Minimize F(0,,/) (2 2)

subject to gj(/) < 0 (j = 1,...,m).

Now, PRSS can be interpreted in Lagrangian form as follows:

N f m ^ 2 m ✓ „ x

L ((00, /), p) = £ Vi - Ao - £ /(Xij) + £ pJ [//(j^ 2dj - Mj , (2 . 3)

i=l I j=1 J j=1 '

where p := (p1,...,pm)T. Here, pj > 0 are auxilary penalty parameters introduced in [4]. In the light of our optimization problem, they can now be seen as Lagrange multipliers associated with the constraints gj < 0. The Lagrangian dual problem takes the form

mxmin l ((0,/ ),p). (2.4)

(,00,/)

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

The solution of this optimization problem (2.4) will help us for determining the smoothing parameters pj and, in particular, the functions / will be found, likewise their bounded curvatures / [/j'(j)] 2dtj. Herewith, a lot of future research is initialized which can become an alternative to the backfitting algorithm concept. In this paper, we go on with refining and discussing the backfitting concept for the additive model.

2.5. Modified Backfitting Algorithm for Additive Model 2.5.1. Additive Model Revisited

For the additive model (cf. Subsection 1.4), we will modify the backfitting algorithm used before for fitting additive model. For this reason, we will use the following theoretical setting in term of conditional expectation (Buja, Hastie and Tibshirani (1989) [5]), where j = 1, 2,...,m:

/j (X) = P [Y - 0° - £ /k(Xfc) J := E ( Y - 0° - £ /(Xfc)| X ) . (2.5)

V k=j / V k=j /

Now, to find more robust estimation for /(Xj) in our additive model, let us add the term

m

- Pk f [/k(tk)] dtk to equation (2.5). In this case, (2.5) will become the update formula

k=l

/j(Xj) - P (y -0° - £/k(Xfcn - (£Pk / V k=j / \k=i ^

/ (tk )]2dtk

k=j

or

E( Y - 0o -J] fk (Xfc )| Xj k=j

/ [fk(tk)]2dtk

(2.6)

fj(Xj) + <pk J [fk(tk)]2dtj ^ Pj ( Y - 0o -J] fk(XkH - X) ^ / [fk(tk)]2dtk

k=j / \k=j ^

= E Y - 00 -J] /(Xk)| XJ - {J2 Vk / [/k (tk)]2dtk k=j k=j m r T2 ^

where Vk / /k'(tk) dtk =: Cj (constant, i. e., not depending on the knots); the functions / k=j

are unknown and to be determined in the considered iteration. By using the notation Cj, we underline that the weighted sum of integrals which Cj denotes is just a scalar value and, herewith, introducing a constant shift in the following. Therefore, we can write equation (2.6) as

fj (Xj ) + <pkj [fk (tj )]2 dtj ^ E\Y - 0o - k=j fk (Xk ) + <pk J [fk (tk )]2dtk

If we denote Zk(Xk) := fk(Xk) + / [fk(tk)]2dtk (the same for j), then we get the

Xj

update formula

Zj(Xj) ^ E ( Y - 0o - £ Zk (Xk)|Xj

k=j

(2.7)

For random variables (Y, X), the conditional expectation /(x) = E (Y |X = x) minimizes E (Y - / (X))2 over all L2 functions / [5]. If this idea is applied to additive model n(X), then the minimizer of E (Y - n(X))2 will give the closest additive approximation to E (Y |X). Equivalently, the following system of normal equations is necessary and sufficient for Z = (Z1,...,Zm)T to minimize E (Y - n(X))2 (for the formula without intercept 0O, we refer to [5]):

( I Pi

P2 I

Pi P2

( Zi (Xi) \

Z2 (X2)

/Pi (Y - 0oe) \ P2 (Y - 0oe)

(2.8)

\ Pm Pm . . I / V Zm (Xm^ \ Pm (Y - 0oe) / where e is the N-vector or entries 1; or, in short,

PZ = Q (Y - .

Here, P and Q represent the matrix and vector of operators, respectively. If we want to apply normal equation to any given discrete experimental data, we must change the variables (Y, X) in the (2.8) by their realizations (y^ Xj), x^ = (xi1,..., xim), and the conditional

m

m

m

expectations Pj = E (•I Xj) by smoothers Sj on Xj

( I

S2

Si I

Sm Sm

51

52

/ zi \ Z2

( Si(y - /3°e) \ S2(y - 0°e)

(2.9)

1 / V zm / V Sm(y - 00°e) 7

For (2.9) we briefly write (in our estimation notation used from now on):

Pz = Q (y - /3°) =: Qyi.

Here, referring to the base spline representation (cf. (1.9)) over Ij, Sj = (hjl(xi))i=i)...)N

z=i,'...,'n

are smoothing matrices of type N x N (i is the row index and l the column index), Zj are

r ^ 12

N-vectors representing the spline function / + pj J /'j(tj) dtj in a canonical form (1.12),

N

i. e., jhj(X) (with the number of unknown equal to the number of conditions). In this i=i

notation, without loss of generality we already changed from lower spline degrees dj to a maximal one d, and to the order N.

Furthermore, (2.9) is an (Nm x Nm)-system of normal equations. The solutions to (2.9) satisfy Zj 6 ^(Sj), where K(Sj) is the range of the linear mapping Sj, since we update by

Zj — Sj y - 0°e - Y, Zk .

k=j

In case, we want to emphasize among the unknowns, i.e., (, zi

T

then

equation (2.9) can equivalently be represented in the following quadratic form:

O O O .O /° OV

I I Si . Si zi SiV

I S2 I S2 . S2 z2 S2V

I

I Sm

I zm

(2.10)

\ SmV J

where O is the N x N matrix with zero entries. There is a variety of efficient methods for solving the system (2.9), which depend on both the number and types of smoother used. If the smoother matrix Sj is a N x N nonsingular matrix, then the matrix P will be a nonsingular (Nm x Nm)-matrix; in this case, the system Pz = Qyi has a unique solution. If the smoother matrices Sj are not guaranteed to be invertible (nonsingular) symmetric, but just arbitrary (N x N)-matrices, we can use a generalized inverses S- (i.e., SjS-Sj = Sj) and P-. For closer information about generalized solution and matrix calculus we refer to [17].

2.5.2. Modified Backfitting Algorithm

Gauss — Seidel method, applied to blocks consisting of vectorial component zi, z2,..., zm, exploits the special structure of (2.9). It coincides with the backfitting algorithm. If in the

z

m

algorithm we write Zj = /j + ^j J /'j (tj) dtj (in fact, the functions /j are unknowns), then, the pth iteration in the backfitting or Gauss — Seidel includes the additional penalized curvature term. Not forgetting the step-wise update of the penalty parameter^- but not mentioning it explicitely, then the framework of the procedure looks as follows:

1

N

N

1) initialize 0O = ^ / = 0 ^ Zj = 0, Vj;

¿=1

2) cycle j = 1, 2,...,m, 1, 2,...,m, 1, 2,...,m,...

Zj ^ Sj

yi - 00 - Zk (xifc )

N

¿=1

This iteration is done until the individual functions do not change: Here, in each iterate,

Zj is by the spline with reference to the knots Xj found by the values y^ — 0O — (xik)

fc=j

(i = 1,..., N), i. e., by the other functions zk and, finally, by the functions / and the penalty

I" " 1 2

(smoothing) parameter . Actually, since by definition it holds Zj = / + ^j / /'j (tj) dtj, throughout the algorithm we must have a book keeping about both / and the curvature

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

r " 1 2

effect J / 'j- (tj) dtj controlled by the penalty parameter ^j which we can update from

step to step. This book keeping is guaranteed since /j and the curvature J / 'j (tj)

can be determined via Zj. Since the value of ^j / / 'j (tj) derivative of Zj is

dtj is constant, the second order

this yields

and, herewith,

j (tj ) = /'j (tj );

/'j(tj) dtj := [j(tj)] 2dtj

fj := % - ^ / /'j (tj)

dtj.

2

2.5.3. Discussion about Modified Backfitting Algorithm

If we consider our optimization problem on (2.1) (cf. also (2.4)) as fixed with respect to ^j, then we can carry over the convergence theory about backfitting (see Section 1.3) to the present modified backfitting, replacing the functions /j by Zj.

However, at least approximatively, we have to guarantee feasibility also, i.e.,

/ 'j(tj)

dj < Mj

(j = 1,...,m).

If J" / 'j (tj) dtj < Mj, then we preserve the value of ^j for l ^ l + 1; otherwise, we increase . But this update changes the values of Zj and, herewith, the convergence behaviour of the algorithm. What is more, the modified backfitting algorithm bases on both terms in the

2

objective function to be approximated by 0; too large an increase of pj can shift too far away from 0 the corresponding penalized curvature value in the second term.

The iteration stops if the functions fj become stationary, i.e., not changing very much

N m

and, if we request it, if ^ < y — ft0 — fj(xj)

z=i [ j=i

under some error threshold e, and, in particular,

becomes sufficiently small, i.e., lying

1 2

f'j (tj ) dtj < Mj j = 1,..., m).

2

2.5.4. Alternative Approaches

If we have I1, ...,Im interval and we chosen penalty terms with the distribution of the knots taken into account by bell-shaped density functions multiplied at the squared second-order

derivatives

f''j (tj) , then our problem PRSS takes the form

N

PRSS (po, fi,...,fm)=^ \vi - Po - £ fj (xij ) } +

j=i

i=1

exp

j=1

(xij Xj)

<31.

f j (tj )

dtj

(cf. (2.1) and (2.3)), where i and j have the same meaning as before, namely, for enumerating the data and spline functions, respectively. Herewith, there would be a "cut-off effect" outside of the intervals Ij and the modified penalty terms even more forced to be closer to zero (Fig. 3). Here Xj and aj are respectively average value and variance for Xj knots which is in the Ij interval and they are defined respectively,

Xj = N

Nj

irE

X-

ij

j

2

<j = NFÏ

Nj (

X-

ij

— - )2 xj )

2

m

2

Besides of this and further improvements of the additive model and the corresponding modified backfitting algorithm being possible, the previous discussions teach us that the developed methods of continuous optimization theory will become an important complementary

Fig. 3. Cutting-off in data approximation.

concept and alternative to the concept of backfitting algorithm. This will be subject of future research.

2.6. On a Numerical Example

Numerical applications arise in many areas of science, technology, social life and economy with, in general, very huge and firstly unstructured data sets; in particular, they may base on data from financial mathematics. These data can be got, e.g., from Bank of Canada (http://www.bankofcanada.ca/en/rates/interest-look.html) as daily, weekly and monthly; they can be regularly partioned, which leads to a partitioning (clustering) of the (input) space, and indices of data variation can be assigned accordingly. Then, we decide about the degrees of the spline depending of the location of the indices between thresholds yv. In this entire process, the practitioner has to study the structure of the data. In particular, the choice on the cluster approach at all, or of the approach on separation of variables, or of a combination of both, has to be made at an early stage and in close collaboration between the financial analyst, the optimizer and the computer engineer. At Institute of Applied Mathematics of METU, we are in exchange with the experts of its Department of Financial Mathematics, and this application is initiated. Using the splines which we determine by the modified backfitting algorithm, an approximation for the unknown functions of the additive model can be iteratively found. There is one adaptive element remaining in this iterative process: the update the penalty parameter, in connection with the observation of the convergence behaviour. Here, we propose the use and implementation of our algorithm as well as an interior point algorithm related with a closed approach to our problems and real-world applications by conic quadratic programming. In the following last section, will will sketch this second approach. A comparison and possible combination of these two algorithmic strategies is what we recommend in this pioneering paper.

3. Concluding Remarks and Outlook

This paper has given a contribution to the discrete approximation or regression of data in 1- and multivariate cases. The additive models have been investigated, input data grouped by clustering, its density measured, data variation quantified, spline classes selected by indices and thresholds, and their curvatures bounded with the help of penalization. Then, the backfitting algorithm which is also applicable for data classification has become modified. This modified backfitting algorithm which we present in our pioneering paper gives a new tool in the arsenal of approximating the unknown functions fj while governing the instability caused. What is more, our paper offers to the colleagues from the practical areas of real-world examples a number of refined versions and improvements. But, this algorithm has some disadvantage in the iterative update of the penalty parameter. Indeed, this update changes the convergence behaviour of the algorithm. For this reason, a further utilization of modern optimization has been recommended [20], diminishing the adaptive and modelfree elements of the algorithm. For this, if we turn to optimization problem (2.2), we can

equivalently write this optimization problem in the following form:

min t,

t,Po,f

N ( m \ 2

where £ lyt - 0o - £ fj (xj ) I < t2, t > 0,

i=i k j=i 1 2

J [f'itj^ dtj < M3 j = 1, 2, ...,m).

As mentioned previously, the functions fj are elements in a corresponding spline spaces. Instead of the integrals J [fj /(tj )] dtj we may use the approximative discretized form of Riemann sums, e.g., by evaluating the base splines fj'(-) at the knots xij. Then, we can write our optimization problem equivalently as

min t,

t,/3o,e

where ||W(0o,9)f2 < t2,

\\Vj (P0 Ml < Mj (j = 1, 2,...,m), 0 < t,

where Ui := ^ xi+ij - Xj (i = 1, 2, ...,N - 1), VjT = (fj'(xij )ui,..., (xN -j )uNand

W (0o,^):=(yi - 0o ^ fj (xj), ... , yN - 0o ^ fj (xNj ))T.

j=i j=i

Herewith, our optimization program has turned to a conic quadratic programming problem [14] which can be solved by interior point algorithms [15, 19]. In future, this approach and the main one presented in our paper will be further compared and combined by us, and applied on real-word data. By all of this we give a contribution to a better understanding of data from the financial world and many other practical areas, and to a more refined instrument of prediction.

The authors express their gratitude to Prof. Dr. Biilent Karsasozen (IAM, METU) for hospitality given to Pakize Taylan in DOSAP program, to him and to Prof. Dr. Hayri Kore-zlioglu (IAM, METU), Prof. Dr. Klaus Schittkowski (University of Bayreuth, Germany), Prof. Dr. Peter Spellucci (Darmstadt University of Technology, Germany), Dr. Omur Ugur and Dr. Qoskun Kucukozmen (both from IAM, METU) for valuable discussions, and to the unknown referee for his valuable criticisim.

m

m

References

[1] Aküme D., Weber G.-W. Cluster algorithms: theory and methods // J. of Comp. Technol. 2002. Vol. 7, N 1. P. 15-27.

[2] Aster A., Borchers B., Thürber C. Parameter Estimation and Inverse Problems. N.Y.: Acad. Press, 2004.

[3] Bock H.H., Sokoowski A., Jajuga K. Classification, Clustering, and Data Analysis: Recent Advances and Applications. B.: Springer-Verl., 2002.

[4] Boyd S., Vandenberghe L. Convex Optimization. L.: Cambridge Univ. Press, 2004.

[5] Buja A., Hastie T., Tibshirani R. Linear smoothers and additive models // The Ann. Stat. 1989. Vol. 17, N 2. P. 453-510.

[6] De Boor C. Practical Guide to Splines. B.: Springer-Verl., 2001.

[7] Easton V.J., McColl J.H. http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm

[8] Fox J. Nonparametric Regression, Appendix to an R and S-Plus Companion to Applied Regression. Sage Publ., 2002.

[9] Friedman J.H., Stuetzle W. Projection pursuit regression // J. Amer. Statist. Assoc. 1981. Vol. 76. P. 817-823.

[10] Hastie T., Tibshirani R. Generalized additive models // Statist. Sci. 1986. Vol. 1, N 3. P. 297-310.

[11] Hastie T., Tibshirani R. Generalized additive models: some applications //J. Amer. Statist. Assoc. 1987. Vol. 82, N 398.

[12] Hastie T., Tibshirani R., Friedman J.H. The Element of Statistical Learning. N.Y.: Springer-Verl., 2001.

[13] Hastie T.J., Tibshirani R.J. Generalized Additive Models. N.Y.: Chapman and Hall, 1990.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

[14] Nemirovskii A. Modern Convex Optimization: Lecture Notes. Israel Institute of Technology, 2005.

[15] Nesterov Y.E., Nemirovskii A.S. Interior Point Methods in Convex Programming. SIAM, 1993.

[16] Politicnico di Milano, Dipartimento de Elettronica e Informazione, A Tutorial on Clustering Algorithms. http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/

[17] Pringle R.M., Rayner A.A. Generalized Inverse Matrices With Applications to Statistics. Hafner Publ., 1971.

[18] Quarteroni A., Sacco R., Saleri F. Numerical Mathematics. Texts in Applied Mathematics 37. Springer, 1991.

[19] Renegar J. Mathematical View of Interior Point Methods in Convex Programming. SIAM, 2000.

[20] Spellucci P. Personal Communication. Germany: Darmstadt Univ. of Technology, 2006.

[21] Stone C.J. Additive regression and other nonparametric models // The Ann. Stat. 1985. Vol. 13, N 2. P. 689-705.

[22] Waggoner D.F. Spline methods for extractions interest rate curves coupon bound prices. Federal Reserve Bank of Atlanta Working Paper 97-10. 1997.

Received for publication 17 July 2006

i Надоели баннеры? Вы всегда можете отключить рекламу.