Read tr151.pdf text version

PRINCIPAL CURVES

by

Trevor Hastie Werner Stuetzle

TECHNICAL REPORT No. 151 Revised October 1988

Department of Statistics, GN-22 University of Washington Seattle, Washington 98195 USA

CURVES

AT&

Murray and

LV.,.....,~,'" Stuetzle

lIm

New Jersey, 07974

IJe.va:rtl11.e1:lt of Statistics University of Washington Seattle, 98195

AUTHORS' FOOTNOTE Trevor Hastie is a Member of the Technical Staff at AT&: T Bell Laboratories, Murray Hill, New Jersey 07974. Werner Stuetzle is Associate Professor, Department of Statistics, University of Washington, Seattle, 98195. This work was developed for the most part at Stanford University with partial support from the Department of Energy under contracts DE-AC03-76SF and DE-AT03-81-ER10843, and the Office of Naval Research contract ONR N00014-81-K-0340, and by the U.S. Army Research

Anareas Buja, Tom Duchamp, lain Johnstone, and Larry Shepp for their theoretical support; Robert 'I'ibshirani, Brad Efron and Jerry Friedman for many helpful discussions and suggestions; Horst Friedsamand Will Oren for supplying the SLC example, and their help with the analysis; both referees for their constructive criticism of earlier drafts.

Abstract

Principal curves are smooth one dimensional curves that pass through the middle of a p dimensional data set, providing a non-linear summary of the data. They are and their shape is suggested by the data. The algorithm for constructing principal curves starts with some prior summary such as the usual principal component line. The curve in each successive iteration is a smooth or local average of the p dimensional points, where the definition of local is based on the distance in arc length of the projections of the points onto the curve found in the previous iteration. In this paper we define principal curves, give an algorithm for their construction, present some theoretical results, and compare our procedure to other generalizations of principal components. Two applications illustrate the use of principal curves. In the first application, two different assays for gold content in a number of samples of computer chip waste appear to show some systematic differences which are blurred by measurement error. The classical approach using linear errors in variables regression can detect systematic linear differences, but is not able to account for nonlinearities. 'When we replace the first linear principle component by a principal curve, a local "bump" is revealed, and we use bootstrapping to verify its presence. As a second example we describe how the principal curve procedure was used to align the magnets of the Stanford Linear Collider. The collider uses around 950 magnets in a roughly circular arrangement tobend-electronand positron beams and bring them to collision. Mterconstructionitwas found that some of the magnets had ended up significantly out of place. As a result the beams had to be bent too sharply and could not be focused. The engineers realized that the magnets did not have to be moved to their originally planned locations but rather to a sufficiently smooth arc through the "middle" of the existing positions. Tills arc was found using the principal curve procedure.

Keywords: Principal Components, Smoother, Non-parametric, Non-linear, L.i~~""'" in variables, Symmetric, Self-Consistency.

1. Introduction

Consider a data set consisting of

17.

observations on two variables, x and y.

can represent the

17.

points in a scatterplot, as in figure La, It is natural to try and summarize the pattern exhibited by the points in the scatterplot. The form of summary we choose depends on the goal of our analysis. A trivial summary is the mean vector which simply locates the center of the cloud but conveys no information about the joint behavior of the two variables. It is often sensible to treat one of the variables as a response variable, and the other as an explanatory variable. The aim of the analysis is then to seek a rule for predicting the response using the value of theexpla.natory variable. Standard linear re~ressionproduces a linear ptediction rule. expectation OfYiis mOdeled as a linear function of x and is usually estimated by least squares. This procedure is equivalent to finding the line that minimizes the sum of vertical squared deviations, as depicted in figure Ia,

In many situations we do not have a preferred variable that we wish to label response, but would

still like to summarize the joint behavior of x and y. The dashed line in figure La shows what happens if we used x as the response. So simply assigning the role of response to one of the variables could lead to a poor summary. An obvious alternative is to summarize the data by a straight line that treats the two variables symmetrica.ilY' The first principal component line in. figure lb does just this by minirxrizin.g.theClrthogonaldeviations. Linear regression has been generalized to include nonlinear functions of z. This has been achieved using predefined parametric functions, and with the reduced cost and increased speed of computing nonparametric scatterplot smoothers have gained popularity. These include kernel smoothers (Watson 1964), nearest neighbor (Cleveland 1979), smoothers (Silverman 1985 and minimize the vertical it is found

general sl:attel'Pl()t smoothens produce a curve

deviations as depicted in figure Ic, subject to some form of smoothness constraint. The non-parametric versions referred to above allow the data to dictate the form of the non-linear aepenoency

paper we COl1S111er similar generauzanens for a str;3.igllt curves we use a SILlOC>tn curve; pass throegh

symmernc srtuanon. Instead of summanzcurve we treat two

¥<LJr14UU::;:'

waetner or

situation is depicted

curves,

3

linear principal components, focus on the orthogonal or shortest distance to This means that

points. vVe formally

define principal curves to be those smooth curves that are self consistent for a distribution or data set. we pick any point on the curve, collect all the data that "project" onto this point and average them, then this average coincides with the point on the curve. The algorithm for finding principal curves is equally intuitive. Starting with any smooth curve (usually the largest principal component), it checks if this curve is self consistent by projecting and averaging. If it is not, the procedure is repeated, using the new curve obtained by averaging as a starting guess. This is iterated until (hopefully) convergence. The largest principal component line plays roles other than that of a data summary: · In errors in variables regression it is assumed that.tll.ere is randomness in the predictors as well-as the response. This can occur in practice when the predictors are measurements of some underlying variables, and there is error in the measurements. It also occurs in observational studies where neither variable is fixed by design. The errors in variables regression technique models the expectation of y as a linear function of the systematic component of z , In the case of a single predictor, the model is estimated by the principal component line. This is also the total least squares method of Golub and van Loan (1979). More details are given in an example in section 8. · Qftenwe want to replace a nwn~erof highlY correlated vari.a,bles by a single varia.ble, such as a normalized linear combination of the original set. The first principal component is the normalized linear combination with the largest variance.

I

· In factor analysis we model the systematic component of the data by linear functions of a small

set of unobservable variables called factors. Often the models are estimated using linear principal one use the largest principal component. Many variations of this model have appeared in the literature. In all the above situations the model can be written as

::Ci

=

random component.

assume

(1)

UO+aAi is

the

squares estimate is

::Ci

principal component.

=

4

+ ei.

This might then be a factor analysis or structural model, and for two variables and some restrictions an errors in variables regression model. In the same spirit as above, where we used the first linear principal component to estimate (1), the techniques described in this article can be used to estimate the systematic component in (2). This paper focuses on the definition of principal curves, and on an algorithm for finding them. We also present some theoretical results, although lots of open questions remain.

2. The principal curves of a probability distribution

We first give a brief introduction to one dimensional curves, and then define the principal curves of smooth probability distributiQns in p space. Subsequent sections give algQrithms for finding the curves, both for distributions and finite realizations. This is analogous to motivating a seatterplot smoother, such as a moving average or kernel smoother, as an estimator for the conditional expectation of the underlying distribution. We also discuss briefly an alternative approach via regularization using smoothing splines.

2.1. One dimensional curves

A one dimensional curve in p dimensional space is a vector f(>..) of p functions of a single variable A. These functions are called the coordinate functions, and A provides an ordering along the curve.

If the coordinate functions are .smooth, then f is by definition a smooth curve. We can apply any

monotone transformation to >.., and by modifying the coordinate functions appropriately the curve remains unchanged. The parametrization, however, is different. There is a natural parametrization for curves in terms of the arc-length. The arc-length of a curve

1= If

f from >"0 to >"1 is given by

1

'\0

'\ 1

IIf'(z)1I dz:

IIf'(z)1I == 1 then 1 = >"1 - >"0. This is a rather desirable situation, since if all the coordinate variables

The vector 1'(>") is tangent to the curve at A and is sometimes called the velocity vector at A. A

are in the same units of measurement, then>" is also in those units.

curve curve

== 1 is

a

speed

curve. 'YVe can our curve, smootnness

any smooth of smoothness relates

more naturally to

translates directly into amoovn visual appearance of

point set {f(A), >.. E A} (absence of sharp

5

If 11 is a unit vector, then f(>..) =

unique -i"'( >..) = u we will always assume that {U,11}

110

+ >"11

is a unit speed straight line. This parametrization is not In the following

+ a11 + >"11 is another unit speed parametrization for the same line.

= O.

fill II f" II is called the principal normal

The vector fll(>..) is called the acceleration of the curve at >.., and for a unit speed curve, it is easy to check that it is orthogonal to the tangent vector. In this case to the curve at >... The two vectors f'(>,,) and f"(>..) span a plane. There is a unique unit speed circle in the plane that goes through f(>..) and has the same velocity and acceleration at f(>..) as the curve itself. The radius rf(A) of this circle is called the radius of curvature of the curve

f at A; it is easy to see that rfC>") = 1/1If"(A)II. The center Cf(A) of the circle is called the center of curvature of f at >..

Thorpe (1979) gives a clear introduction to these and related ideas in differential geometry.

II Insert figure 3 around here

2.2. Definition of principal curves

II

Denote by X a random vector in lRP with density h and finite second moments. Without loss of generality assume E(X)

= O. Let f

denote a smooth (COO) unit speed curve in lRP parametrized over

A ~ JRl, a dosed (possibly infinite) interval, which does not intersect itself (AI and has finite length inside any finite ball in lRl'. We define the projection index Af: lRl'

>"f(~)

-+

#

A2 => f(>"l)

# f(A2»,

JRl by

= SUp{A: II~ ).

f(>")11

= infll~ Ii

f(,u)II}·

~.

(3)

If there are several such

The projection index Af(3;) of.~ is the value.of >.. for which f(A) is closest to

values, we pick the largest one. We show in the appendix that Af( e ) is well defined and measurable.

Definition

The curve f is called self-consistent or a principal curve of h if E(X IAf(X) = A) = f(>..) for a.e. A. Figure 3 illustrates particular parameter curve. If f( A) is principal curve. In the we our algorithms to come; we intuitive motivation Deluna our dennitron of a prmcrpai curve. For any

>.. we couect all

observations, and if

have f(>..) as holds for

closest point on

f

is called a

a aeignbcracoc

on

curve.

~v".r~<rin($'

to estimate conditionai expectations.

6

The definition of principal curves immediately

rise to a number of interesting questions: for principal curves are there for a

what kinds of distributions do principal curves exist, how many

given distribution, and what are their properties? We are unable to answer those questions in general. vVe can, however, show that the definition is not vacuous curves. It is easy to check that for ellipsoidal distributions the principal components are principal curves. For a spherically symmetric distribution, any line through the mean vector is a principal curve. For any two dimensional spherically symmetric distribution, a circle with center at the origin and radius E there are densities which do have principal

IIXII

is a principal curve. (Strictly speaking, a circle does not fit our definition, because it does intersect itself. See,however, our note at the beginning of the appendix.) We show in the appendix that for compact A it is always possible to construct densities with carrier in a thin tube around

f, which have f as a principal curve.

What about data generated from the model X

= f(A) + c, with

f smooth and E(c)

= O.

Is

f a

principal curve for this distribution? The answer seems to be no in general. We will show in section 7 in the more restrictive setting of data scattered around the arc of a circle that the mean of the conditional distribution of :v, given A(:V) = Ao, lies outside the circle of curvature at Ao; this implies that We have some evidence that this bias is small, and

f cannot

be a principal curve. So in this situation the principal curve will be biased for the functional model. decreases to zero as the variance of the errors gets small relative to the radius of curvature. We discuss this bias as well as estimation bias (which fortunately appears to operate in the opposite direction) in section 7.

3. Connections between principal curves and principal components

In this section we establish some facts that make principal curves appear as a reasonable generalization of linear principal components. Proposition 1: If a straight line leA) Proof: The has to pass

ttll~ou,~h

= Uo + AVo is self consistent, then it is a principal component.

v.,.&u<,

the

because

0= E(X) =

(X IAf(X) = A)

=E,\(Uo+ =Uo+

we assumed

UQ

.L

remains to

7

the covariance of X:

Evo = E(XXt)vo

= E"E(XXtvo /Aj(X) = A) = E"E(XXtvo Ixtvo = A) = E"E(AX Ixtvo = A)

= E" A2 VO

o

Principal components need not be self consistent in the sense of the definition; however, they are self consistent with respect to linear regression. Proposition 2: Suppose leA) is a straight line, and we linearly regress the p components Xj of X on the projection Ae(X) resulting in linear functions fj(A). Then

1£0

1= l

iff Vo is an eigenvector of E and

= o.

The proof of this requires only elementary linear algebra and is omitted.

3.1. A distance property of principal curves An important property of principal components is that they are critical points of the distance from the observations. Let d(3::,/) denote the usual euclidean distance from a point 3:: to its "projection" on I: d(3::,/)

=

113:: - I(Aj(3::»II, and define

Consider a straight line leA) = 1£+A11. The distance D 2(h,/) in this case may be regarded as a function of 1.£ and 11: D2(h,l) == b 2(h , 1£, v ). It is well known that gradu,vD2(h,1£,v) 0 iff 1£ 0 and v is an

=

=

eigenvector of E, i.e., the line l is a principal component line. vVe now restate this fact in a variational setting and extend it to principal curves. Let class of curves parametrized over A. For 9 E

9 denote a

9 define It = I

+ tg.

This creates a perturbed version of

Definition curve

I

is

a entreat point of the distance function

variations

the class

9 iff

=

8

9E

Proposition 3: Let (h denote the class of straight lines g(A)

= a+Ab. A straight line lO(A) = aO+Abo

is a critical point of the distance function for variations in (h iff vo is an eigenvector of Cov(X) and

ao = O.

The proof involves straightforward linear algebra and is omitted. A result analogous to proposition 3 holds for principal curves. Proposition 4: Let (iB denote the class of smooth (C CO ) curves parametrized over A, with and IIg'lI .$ 1. Then perturbations in (i B. A proof of proposition 4 is given in the appendix, The condition that Ilg11 is bounded guarantees that

IIgli

.$ 1

1 is a principal curve of h iff 1 is a critical point of the distance function for

it lies in a thin tube around 1 and that the tubes shrink uniformly, as t -;. O. The boundedness of

IIg'llensures that for t small enough,

I: is well behaved and in particular is bounded away from 0 for

Aft

t < 1. Both conditions together guarantee that, for small enough t,

4. An algorithm for finding principal curves

is well defined.

In analogy to linear principal component analysis, we are particularly interested in finding smooth

curves corresponding to local minima of the distance function. Our strategy is to start with a smooth curve such as the largest linearprincipal component, and check if it is a principal curve. This involves projecting the data onto the curve, and then evaluating their expectation conditional on where they project. Either this conditional expectation coincides with the curve, or we get a new curve as a byproduct. We then check if the new curve is self consistent, and so on. If the self consistency condition is met, we have found a principal curve. It is easy to show that both the operations of projection and conditional expectation reduce the expected distance from the points to the curve.

4.1. The principal curve algorithm

The above discussion motivates the following iterative algorithm. initialization: Set 1(0)(>.) = z+a>. where a is the the first linear principal component of h. Set A(O)(::e) = >'j<O)(::e) repeat: over iteration counter j

1)

I(j)(·) =

.u'CJlll'C

IAj<i-1) (X) = .)

= >'f(j)(::e) V::e E

so

l(j)

2)

is

=

some tnresncta.

9

There are potential problems with the above algorithm. While principal curves are by definition differentiable, there is no guarantee that the algorithm have curves produced by the conditional expectation step of

property. Discontinuities can certainly occur at the endpoints of a curve.

problem is illustrated in figure 4, where the expected values of the observations projecting onto f( Amin) and f( A =:) are disjoint from the new curve. m

If this situation occurs, we have have to join f(Amin) and f{Ama:r;) to the rest of the curve ina

differentiable fashion. Inl1ghtor the above, we cannot prove that the algorithm converges. All we have is some evidence in its favor:

· By definition principal curves are fixed points of the algorithm.

· Assuming that each iteration is well defined and produces a differentiable curve, we can show that the expected distance D2(h,f(j) converges.

· If the conditional expectation operation in the principal curve algorithm is replaced by fitting a

least squares straight line, then the procedure converges to the largest principal component.

5. Principal curves for data sets

So far, we have considered principal curves of a multivariate probability distribution. In reality, however, we always work with finite multivariate data sets. Suppose then that X is an n

X

p matrix of

n obsersatiens on p variables. 'Ve regard the data set as a sample from an underlying probability

distribution. A curve f(A) is represented by n tuples (Ai,fi), joined up in increasing order of A to form a polygon. Clearly the geometric shape of the polygon depends only on the order, not on the actual values of the Ai. 'Ve will assume that the tuples are sorted Al = 0 and Ai is the increasing order of A, and we along the from use the arc-length parameterization, for

fl to fi.

is

distributlen case,

algorithm alternates between a projection

an expectation as a starting curve;

absence of to be

mforrnation we use

pr()jei:ti()ns of

n cbservations onto

is

some taresnold

arssance is estimated.

addmg

up

squared distances of the points in the sample to their closest points on the current curve. We are unable to prove that algorithm converges, nor that each step guarantees a decrease in the criterion. In

practice we have had no convergence problems with more than 40 real and simulated examples.

5.1. The projection step

For fixed fW(.) we wish to find for each Zi in the sample the value Ai Define

= Af(j)(Zi).

dik as the distance between Zi and its closest point on the line segment joining each pair

(f(i)(A~»), fW(A~ll))' Corresponding to each dik is a value Aik E [A~), A~lIJ. We then set Ai to the

Aik corresponding to the smallest value of dik:

(4)

Corresponding to each Ai is an interpolated replace Ai by the arc length from fii) to fi(i).

5.2. The conditional expectation step-Scatterplot smoothing

f.ei ); using these

values to represent the curve, we

The goal of this step is to estimate f(i+1)(A)

= E(X

IAf(j)

= A).

We restrict ourselves to estimating

this quantity at only n values of A, namely AI,'" ,An found in the projection step above. A natural way of estimating E(X I Af(j) = Ai) would be to gatherall the observations that projectontoj(i) at Ai, and find their mean. Unfortunately there will in general be ocl.y one such observation, namely Zi. It is at this stage that we introduce the scatierplot smoother, a fundamental building block in the principal curve procedure for finite data sets. We estimate the conditional expectation at Ai by averaging all the observations Zk in the sample for which Ak is close to Ai. As long as these observations are close enough and the underlying conditional expectation is smooth, the bias introduced in approximating the conditional expectatioIl,willbe small. On the other hand, the variance of include more observations in the neighborhood.

5.2.1 Scatterplot smoothing

estimate decreases as we

Local averaging is not a new to estimate are

A<::J.ll<::J.

In the more common regression context, seatterplot smoothers are

averagsag, Some commonty

Cleveland (1979). These smooth a one cimen-

smoosners

locally weigated runnmg response against tion

In our case,

VdJ,"li::l.Ult:l

to

smeetnea is p dimeasionat; so we

current rmptementancn any scatterptot

11

;:)Hl,VVUI':a

to

experience with all those mentioned above, although all the examples were fitted using locally weighted running lines. We give a brief description; for details see Cleveland (1979).

5.2.2 Locally weighted running lines smoother

Consider the estimation of E (x

IA), Le.

a single coordinate function, based on a sample of pairs

(At,XI),"', (An, x n ), and assume the Ai are ordered. To estimate E(x IA), the smoother fits a straight line to the wn observations {Xj} closest in Aj to Ai. The estimate is taken to be the fitted value of the line at Ai. The fraction w of points in the neighborhood is called the span. In fitting the line, weighted least squares regression is used, The weights are derived from a symmetric kernel centered at Ai that dies smoothly to zero within the neighborhood. Specifically, if hi is the distance to the wn'th nearest neighbor, then the points

s, in the neighborhood get weights Wij = (1

-I '\i.,xi r)3.

5.3. A demonstration of the algorithm

To illustrate the principal curve procedure, we generated a set of 100 data points from a circle in two dimensions with independent Gaussian errors in both coordinates:

X ( Xzl) =

(5sin(A)) 5 cos(A)

+

(e

l) ez

(5)

where A is uniformly distributed on [0,211") andel and ez are independent N(o, 1).

II Insert figures

5a-d about here

II

Figures 5a-d show the data, the circle (dashed line) and the estimated curve (solid line) for selected steps of the iteration. The starting curve-is the fitstprincipal component, in figure 5a. Any line through the origin is a principal curve for the population model (5), but in general this will not be the case for data. Here the algorithm converges to an estimate for another population principal curve, the circle.

example is admittedly artificial, but tough job.

n"D<:",nf'<:

principal curve procedure with a particularly

starting guess is wholly Inappropriate

projection of the points onto this solution curve. is a new are

do not .... "",..-1" represeat

12

found points along

points onto the new curve. It can be seen that the new curve can be

of the projeeteo

different to the ordering along the previous curve. This enables parametrized as a function of the

the successive curves to bend to shapes that could not principal component. 5.4. Span selection for the scatterplot smoother

The crucial parameter of any local averaging smoother is the size of the neighborhood over which averaging takes place. We discuss here the choice of the span w for the locally weighted running line smoother.

A fixed span .<::trlltefl11.

The common first guess for.f is will not be a function of the arc length

In

ofth.e i:nt¢restingsituations, the final curve

initfal-curve (see:fi.gure 6). It is reached by successively

bending the original curve. We have found that if the initial span of the smoother is too small, the

curve may bend too fast, and follow the data too closely. Our most successful strategy has been to initially use a large span, and then to decrease it gradually. In particular, we start with a span of O.6n observations in each neighborhood, and let the algorithm converge (according to the criterion outlined above). We then drop the span to O.5n and iterate till convergence. Finally the same is done at OAn by which time theprocedure.h.as found fou:nd usingth.isstrategy. Spans of this magnitude have frequently been found appropriate for scatterplot smoothing in the regression context. In some applications, especially the two dimensional ones, we can plot the curve and the points and select a span that seems appropriate for the data. Other applications, such as the The curves ·in figure 5 where

collider ring example in section 8, have

A utomatic span selection by cross-validation.

.c>.""U.'''''O the peoeedure has converged to a

consistent (with respect to the smoother) curve for the to density of data. As

span last used. we reduce

do not want

fitted curve to be too wiggly

averaee dlstaIICe decreases and

lllaJ'U.,LILl;5

curve follows fideUty to

data more rk,;;plv we

human eye is skllled at

t17adeotts between smootnness

this jUd.gm.ent automatrcany. A similar situation arises in non-parametric regression, we

a response y

a covariate

smootnness juagmene autoltna.tlc:alIlY is to ensure

proceeds as toU.OWS.

13

response Yi in

using a smooth estimated from the and define an approximately unbiased estimate of

with the ith observation

let

Y(i)

be this predicted too

cross- validated residual sum of

CVRSS =

-

Y(i)Y~' CVRSS/n is

expected squared prediction error. H the span

the curve will miss features in the data, and the bias component of the prediction error will dominate. H the span is too small, the curve begins to fit the noise in the data, and the variance component of the prediction error will increase. We pick the span that corresponds to the minimum CVRSS.

In the principal curve algorithm, we can use the same procedure for estimating the spans for each

coordinate function separately, as a final smoothing step. Since most smoothers have this feature built in as an option, cross- validation in this manneris trivial to implement.

Figure 7(a) shows the final curve after one more smoothing step, using cross- validation to select the span-nothing much has changed. Figure 7b,on the other .hand, shows what happens if we continue iterating with the cross- validated smoothers, The spans get successively smaller, until the curve almost interpolates the data. In some situations, such as the SLC example in section 8, this may be exactly what we want. It is unlikely, however, that In this eventcross.validationwouIdbe used.to pick the span' A possible explanation for this behavior is that the/errors in the co--ordiIl.<l.tefunctions are auto--correlated; cross- validation in this situation tends to pick spans that are too small (Hart and Wehrly, 1986).

5.5. Principal curves and splines

Our algorithm for estimating principal curves from samples is motivated by the algorithm for prmcrpat curves. This is analogous to the motivation for kernel smoothers and locally weighted running lines smoothers. They a conditional expectation, a population quantity that minimizes a population criterion. criterion. hand, cubic smooenrng penalty (smootlling parameter] p. minimizes

Xi -

They do not minimize a data 5moot;hiIltg spnnes, on a set

11.

·.· , (An,

D 2 (J) =

I)

i=l

n

(6)

E

i

= 1"",11.

14

so that

Dzef,>')

is over

= tll~i i=1

f(>'i)lI Z + p

r lIf (>.)lizo; l«

ll

(7)

f

with

Ii

E S2[O,

Notice we have confined the functions to the unit interval, Intuitively, for a fixed smoothing parameter p,

and thus do not use the unit speed

functions defined over an arbitrarily large interval can satisfy the second-derivative smoothness criterion and visit every point. It is easy to make this argument rigorous. We now apply our alternating algorithm to this criterion · given min.imi;z;ing DZ(f, >.) res,ealE!d to lie · Given.>'i, only involves the first part of (7) and is our usual pr()je<:ticln

up. into p expressions of the form (6), one for each coordinate function.

These are optimized by smoothing the p coordinates against >'i using a cubic spline smoother with parameter p. The usual penalized least squares arguments show that if a minimum exists, it must be a cubic spline in each coordinate. We make no claims about its existence, or about global convergence properties of this algorithm. An advant~geofthe.sPMIle smoothin~)alg9rithm Is.that it can be c0ntputed in O(n) operations, andth"Us is.a.strOIlg competitor for the kernel typesmoothers which take O(n Z) unless approximations are used. Although it is difficult to guess the smoothing parameter p, alternative methods, such as using the approximate degrees of freedom (see Cleveland, 1979) are available for assessing the amount of smoothing and thus selecting the parameter. Our current implementation of the algorithm

a

of smooenmg splines

Q.1111.cUlt to distinguish their performance in practice.

5.6. Further illustrations and discussion of the algorithm

The procedure worked

on the

example, surprjsinj!Z;, at

on a number of other

artitir:iaJ

examples. set

COJ£lsilier a

a a

A

prmcrpat curve, as are observations than the circle, however

15

'When the principal curve procedure is started

the circle, it does not move much, except at

the endpoints, as depicted in figure 9a. This is a consequence of the smoothers endpoint behavior in that it is not constrained to be periodic. Figure 9b shows what happens when we use a periodic version of the smoother. However, starting from the linear principal component (where theoretically it should stay), the algorithm iterates to a curve that, apart from the endpoints, appears to be attempting to model the circle (see figure 9c; this behavior occurred repeatedly over a number of simulations of this

j

example. The, ends of the curve are "stuck" and further iterations do not free them. The example illustrates the fact that the algorithm will tend to find curves that are minima of the distance function. This is not surprising; after all, the principal curve algorithm is a generalization of the power method for taken. Interestingly, the algorithm using the periodic smoother and starting from the linear principal component finds the identical circle to that in figure 9b. 6. Bias considerations: model and estimation bias Model bias occurs when the data are of the form z = f(>..)+e, and we wish to recover f(>..). In general, if f(>..)hascu.ryature,itwilLnot be a. principalcurveforthedistriblltion the principal curve procedure can only find a biased version of f(>"), even if it As a consequence at the generating which 'e.."thibits behavior. The power method will, tend to converge to an eigenvector for the largest eigenvalue, unless special precautions are

curve. This bias goes to zero with the ratio of the noise variance to the radius of curvature. Estimation bias occurs because we use scatterplot smoothers to estimate conditional expectations. The bias is introduced by averaging over neighborhoods, which usua.lly has a flattening effect. We demonstrate this bias with a simple example.

6.1. A simple model for investigating bias

curve f is an arc of a from a bivariate

closest to

(J

with center more mass is

p,

z are generated

mean cnosen uniformly on

srtuanoa. Il[lttutively: it seems

centered at

arc

to

a point, the

se~:nu~nt shrinks

down to

to the curve at that

but there is always more

mass outside

circle than

implies that the conditional expectation lies outside the circle.

We can prove (Hastie, 1984) that

where

re = r

*sin(8/2) 8/2 '

(8)

and

Finally r* -:,. p as a / p -:,. O. Equation (8) nicely separates the two components of bias. Even if we had infinitely many observations, and thus would not need local averaging to estimate conditional expectation, the circle with radius p would not be a stationary point of the algorithm; the principal curve is a circle with radius r" The factor

>

p,

Sin£i£2)

is due to local averaging. There is clearly an optimal span at which the two bias

components cancel exactly. In practice this is not much help since we require knowledge of the radius of

curvatureandtb.~.error variance is needed to determine it. Typically these .quan.tities·win be cha~ging

as we move along the curve. Hastie (1984) gives a demonstration that these bias patterns persist in a situation where the curvature changes along the curve.

7. Examples

This section contains three examples that illustrate the use of the procedure.

7.1. The Stanford Linear Collider (SLC) project

This application of principal curves was implemented by a group of l'!:e()de~tlc engineers at the Stanford Linear Center (SLAC) in the software Jer()me l?'riedm'l.n of SLAC.

coltidE~S two

by

Intense studied by

nar'tirllp

cetnsion are recorded

pJtlYS1:ClSl;S, is to discover new su[)-atOIJWC to accelerate a pOl,itI'on

in a coilision chamber

an etectron

cotnder arcs

to

to

17

Each of the two collider arcs contain roughly 475 magnets (23 segments of 20 plus some extras), which guide the positron and electron beam. Ideally these magnets lie on a smooth curve with a circumference of about 3km as depicted in the schematic The collider has a third dimension, and actually resembles a floppy tennis racket. This is because the tunnel containing the magnets goes underground (whereas the accelerator is above ground). Measurement errors were inevitable in the procedure used to place the magnets. This resulted in the magnets lying close to the planned curve, but with errors in the range of :I:: 1.0mm. A consequence of these errors was that the beam could not be adequately focussed. The engineers realized that it was not necessary to move the magnets to the ideal curve, but rather to a curve through the existing magnet positions and smooth enough to allow focused bending of the beam. This strategy would hopefully reduce the amount of magnet movement necessary. The principal curve procedure was used to find this curve. The remainder of this section describes some special features of this simple but important application. Initial attempts at fitting curves used the data in the measured 3 dimensional geodetic coordinates, but it. was found that the magnet displacements 'Vere small rela,tiveto

th~

bias inqucedby smoothing.

The theoretical arc was then "removed", and subsequent curve fitting was based on the residuals. This was achieved by replacing the 3 coordinates of each magnet by three new coordinates:

1) the arc length from the beginning of the arc till the point of projection onto the ideal curve (x),

2) the distance from the magnet to this projection in the horizontal plane (y), and 3) the distance in the vertical plane (z). This technique effectively removed the major component of the bias, and is an illustration of how special situations lend themselves to adaptations of the basic procedure. Of course knowledge of the ideal curve is not in other applications. this application. The fitted a vertex further it is

was a nal;ur;u way of cnoosmg curve, once transformed at

ha:rdE~r

to

original coordinates, can be represented by a segments is vital

imnnrt:::J!.nrp

se~:mt~nt

since

it is td.latmch if Oi measures magnet specify a threshold

next

withotlt hittj:ng no smoothing at

all results in no

movement (no work), but with many magnets violating the threshold. As the

amount of smoothing (span) is increased, the angles tend to decrease, and the residuals and thus the amounts of magnet movement increases. The strategy used was to increase the span until no magnets violated the angle constraint. Figure llb gives the fitted vertical and horizontal components of the chosen curve, for a section of the north arc consisting of 149 magnets. This relatively rough curve was then translated back to the original coordinates, and the appropriate adjustments for each magnet were determined. The systematic trend in these coordinate functions represent systematic departures of the magnets from the theoretical curve. It turned out that only 66% of the magnets needed to be moved, since the remaining 34% of the residuals were below 60 J.Lm in length and considered negligible. There natural constralIlts on the syste.m. Some of the magnets were ftxedby design, and

thus could not be moved. The beam enters. the arc parallel to the accelerator, so the initial magnets dono beniling. Similarly there are junction points at which no bending is allowed. These constraints are accommodated by attaching weights to the points representing the magnets, and using a weighted version of the smoother in the algorithm. By giving the fixed magnets sufficiently large weights, the constraints are met. Figure llb has the parallel constraints built in at the endpoints. Finally, since some of the magnets were way off target, we used a resistant version of the fitting procedure. Points are weighted according to their distance from the fitted curve, and deviations beyond a fixed threshold are given weight O.

7.2. Gold assay pairs

A California based company collects computer chip waste in order to sell it for its content of gold and other precious metals. Before bidding for a particular cargo, the company takes a sample to estimate the gold content of the the lot. The Qne sub..s;:l.mple is assayed by an outside laboratory, the other by their own inhouse laboratory. The company wishes to eventually use only one of the assays. It is in their interest to know which laboratory produces on average lower gold content assays for a given sample.

data

:l:ji

12a consists

assays. Each point represents an

oh';PY'V:l.. tiC1ITl

=

i=

1 corresponds to oz

even scatter

were many more

than larger ones (> 100z per ton)). Our model for the above data is

(:::) =

e~)) + (:::)

eji

(9)

where Ti is the expected gold content for sample i using the inhouse lab assay, f( Ti) is the expected assay result for the outside lab relative to the inhouse lab, and with Var(eli) = Var(e~2i) is measurement error, assumed i.i.d.

'V i,

This is a generalization of the linear errors in variables model or the structural model (if we regard the Ti themselves as fixed): (10) Model (10) essentially looks for deviations from the 45° line, and is estimated by the first principal component. Model (9) is a special case of the principal curve model, where one of the coordinate functions is the identity. This identifies the systematic component of variable

X2

random variables), or the filnctional model (if the Ti ·<iXe considered

with the arc-length parameter.

Similarly we estimate ·. (9) using a natural variant. of the principal curve algorithm. Illthesmoothin~ step we smooth only

Xl

against the current value of T, and then update

T

by projecting the data onto

the curve defined by (I( T), T). The dashed curve in figure 12a is the usual scatterplet smooth of

Xl

against

X2

and is clearly

misleading as a scatterplot summary. The principal curve lies above the 45° line in the interva!l.4 to 4 which an untransformed of valuable. bend in the curve is real, or whether the same we and see

assay tends to be lower than that of the outside lab. The difference is reversed at lower levels, but this is of less practical importance since at these levels the lot is

A natural question arising at this point is whether

linear model (10) is sinlply calculate

If we

curves

access to more data

absence of \Ve COIn'Plllte as essentrauy onto a

straIl2~ht

ad,lltion.a! sampies, we use the bootstrap

residual vectors

to simulate

(9). We therefore scale them up by a factor of V2. We then sampled with replacement from this pool, and reconstructed a bootstrap replicate by adding a sampled residual vector to each of the fitted values of the original fit. For each of these bootstrapped data variance. Figure 12b shows the errors in variables principal curves obtained for 25 bootstrap samples. The spread of these curves give an idea of the variance of the fitted curve. The difference between their average and the original fit estimates the bias, which in this case is negligible. Figure 12c shows the result of a different bootstrap experiment. Our null hypothesis is. that the relationl'lh.ip is linear, we way as above, but replacing the principal curve by the linear errors in variables line. The observed .curve (thick solid curve) lies outside the band of curves fitted to 25 bootstrapped data sets, providing additional evidence that the bend is indeed real.

7.3. One dimensional color data

the full curve fitting procedure was applied

and the fitted curves were saved. This method of bootstrapping is aimed at exposing both bias and

The data in this example were used by Shepard and Carroll (1966) (who cite the original source as

Ii

Boynton and Gordon (1965» to illustrate a version of their parametric data representation techniques called proximity analysis. We give more details of this technique in the discussion.

The data set consists of 23 observations on four variables. It was obtained by exposing 100 observers to light of each of 23 different wavelengths. The four variables are the percentages of observers assigning the color descriptions blue, projection of the data and the principal components. The observations lie very dose to functions of the

sma.u..~r

the plane spanned by the first two linear principal curve, which provides an of summary of the of

data set. (Note that this cannot be deduced from figure 13a alone.) Figure 13b shows the coordinate curve. Arc

int;erl~stjinlZ: feature

curve turns out to be a monotone two

of the curve is that the d.ls:ta.Jlce between

than extreme and the mrdare.

two extremes of the seectmm

are judged more

21

8. Extension to higher dimensions - principal surfaces

vVe have had some success in extending the definitions and algorithms for curves to cater for two dimensional (globally parametrized) surfaces. A continuous 2-dimensional globally parametrized surface in IRP is a function

.Ii S;;;

I =.Ii -;. IRP for

m.2 ,

where

I is a vector of continuous functions:

h(>.t,)..2)

I(A) =

12(>\1, )..2)

(11)

II Insert figure 14 about here

Let X be defined as before, and let over A S;;;

II

I denote a smooth 2-dimensional surface in JRP, parametrized

m.

Here the projection index Af( m) is defined to be the parameter value corresponding to

the point on the surface closest to m. The principal surfaces of hare those members of (;2 which are self consistent:

E(X IAf(X)

= A) = I(A) for a.e.

A

Figure 14 illustrates the situation. We do not yet have a rigorous justification for these definitions, although we have had success in implementing an algorithm. similar to the curve algorithm; two dimensional surface smoothers are used instead of one dimensional scatterplot smoothers, vVe refer the reader to Hastie (1984) for more details of principal surfaces, the algorithm to compute them and examples.

9. Discussion

Ours is not the case att,eml:>t at fiIld.ing a metnca approaches to

l:1"l>·J:l.l:&,n

fittiJrlg ncnlinear maIuioi,ds to

mulltiV<!Lriaj~e

problem we will restrict ourselves to one dimensicnal mamtotds,

paper. to ours was

su~~ested

22

where peA) is a vector of polynomials PiCA) find the coefficients of

=

apeAk of prespecified degrees ](i' The goal is to

polynomials and the Ai, i

= 1 ... n

minimizing the loss function L lIeill2.

algorithm makes use of the fact that for given Al ... An the optimal polynomial coefficients can be

found by linear least squares, and the loss function thus can be written as a function of the Ai only. Carroll was able to give an explicit formula for the gradient of the loss function, which is helpful in the

n dimensional numerical optimization required to find the optimal A'S.

The model of Etezadi-Amoli and McDonald (1984) is the same as Carroll's, but they use different goodness-of-fit measures. Their goal is to minimize the off-diagonal elements of the error covariance m<:WrJ.x

:r: = Et E, which is

in the spirit. of classical linear factor analysis. Various measures for the

on-d.lagion.;;w' etements are suggested,·for example Li:pi uri' Their algorithm is similar to ours in that it· alternates betwef!n impl."ovingthe A'S for given polynomial coefficients; and finding the optimal polynomial coefficients for given A'S. The latter is a linear least squares problem, whereas the former constitutes one step of a nonlinear optimization in n parameters. Shepard and Carroll (1966) proceed from the assumption that the P dimensional observation vectors lie exactly on a smooth one dimensional manifold. In this case it is possible to find parameter values AI ... An such that for each one of the P coordinates, Xii varies smoothly with Ai. The basis of their method is a measure for the degree of smoothness of the dependence of Xii on Ai. This measure of smoothness,smnmedoverthep.coordinates, optimized with respect to the A's oI1eftnds

those values of Al ... An which make the dependence of the coordinates on the A's as smooth as possible.

We will not go Into the definition and motivation of the smoothness measure; it is quite subtle,

and we refer the interested reader to the original source. We just wish to point out that instead of optimizing smoothness, one could optimize a combination of smoothness and fidelity to the data as described in section 5.5, which would lead to modeling the coordinate functions should allow the method to better deal with noise in the data.

as spline funCtions aIld

In view of this previous work, what do we think is the contribution of the present paper?

· From the operational mal A's

tir'"",,,,

of view it is advantageous

there is no need to specify a parametric the optiattractive

form for the cOJrdlin;a.te tunettens, Because the curve is represented as a ceordmate mactioas is easy. of pnncipar curves to data sets.

·

agrees a summary.

prmcrpat curves as conditional sxpectatrons

tion of linear principal components. This dose connection is further emphasized by the fact that linear principal curves are principal components and that the algorithm converge to the largest

principal component, if conditional expectations are replaced by least squares straight lines.

Appendix - Proofs of Propositions.

Throughout the appendix we make the following assumptions:

Assumptions

I denote a smooth (COO) unit speed curve in lR!' parametrized over a closed, possibly infinite interval A ~ m . We assume that I does not

Denote by X a random vector in lR!' with density h and finite second moments. Let

1

intersect itself (Al# A2

I(Al) :J: I(A2)) and has finite length inside any finite ball.

No.te: Under these conditioIls, the set {/(A), AE A} forms a smooth, connected l-dimenslonal manifold diffeomorphic to the interval A. "'ny smooth, connected I-dimensional manifold is diffeomorphic either to an interval or to a circle (Milnor, 1965). The results and proofs below could' be slightly modified to cover the latter case (closed curves). Existence of the projection index. Existence of the projection index is a consequence of the following two lemmas. Lemma 5.1: For every z E lRP and for any r

> 0 the set Q = {A I liz - I(A)1I :::; 1'} is compact.

Proof:>Q is closed, because liz - I(A)II is a continuous function of A. It remains to show that Q is bounded. Suppose it were not. Then there would exist an unbounded monotone sequence Al' A2,'" with liz - f(Ai)1l

~

1'. Let B denote the ball around z with radius 21'. Consider the segment of the curve between I(Ai) and I(AH1)'

The segment either leaves and reenters B, or it stays entirely inside. This means that it contributes at least min(2r, IAi+! - Ail) to the length of the curve inside B. As there are mfimteiy sequence

I would have infinite length in B, which is a contradiction.

0

Lemma 5.2: For every z E lR!' there exists A E A for which

liz- I(A)II = PEA liz inf

Proof: Define r attained. 0

1(Il)ll

=

B = {Il I Since B is non-empcy and compact ,>",aMao.

IIz-

=

side is

=

24

=

is

Proof:

P , ,,~- fU\)"

0

=d(~, f)}

ll(~)

5.2) and COll'lpaet

(l!~mnla

and therefore

has a ''''''''0''''''' element.

It is not hard to show that

is measurable; a

is available upon request.

Stationarity of the distance function. We willfust establish some simple facts which are of interest in themselves. Lemma 6.1: If f(lo) is a closest point to ~ and lo E A0 , the interior of the parameter interval, then :z: is in the normal hyperplane to f at f(lo):

(~ - f(lo), f'(.>.o»

=0

defined (lo

point

~

is called an ambiguity point for a curve fif it has more than one closest point on

the curve: cardp '''~ - f(l)1l measure O.

= d(~,f)} > 1.

Let A denote the set of ambiguity points. Our next goal will be to show that A is measurable and has Define M A, the orthogonal hyperplane to f at l, by

MA = {:z: '(:z: - f(l), f'(l» = O}

point 1vfA' It will be useful to 1 into UA M . Choose p - 1 smooth vector fields nl(l), ... , np_l(l) such mapping that maps A x IRfA

define

that for every l the vectors f'(A), nl().),···, np_l(l) are orthogonal. It is well known that such vector fields do exist. Define X : A x

IRr 1 ....... IRP by

X(l, v)

= f(l) +

and set M

= x(A, :mr- 1 ) , the set of all points in IRP lying in some hyperplane for some point on the curve. The mapping X is smooth, because f and nl, ... ,np_l are assumed to be smooth.

\Ve shall now present a few observations which simplify showing Lemma 6.2: peA () Proof:

A has measure O.

= O.

Accordin,g to lemma 6.1, this is posrstble if A is a set of of this measure 0 interval forms a is measurable and has

ayperplane, which has measure O.

measure O.

0

set. It is sufficient to show for

Lemma 6.3: Let E be a measure

E

=

25

--

Proof:

open covering {Nez) Iz E JR..!' \ E} of JRP \ E contains a countable covering {N;}, because the

topology of JR.P has a countable base.

0

Lemma 6.4: \Ve can restrict ourselves to the case of compact A.

Proof: Set An

= An [-n, nJ, In = 1/An, and An the set of ambiguity points of In. Suppose z is an ambiguity

point of Ij then {A I liz - I(A)l1

= d(z,/H

is compact (lemma 5.1). Therefore

z

E An for some n, and

Aclh An·

0

\Ve are now ready to prove Proposition 6: The set of ambiguity points has measure O. Proof: We

can

restrictollrselvesto the case of cOUlpact A (lemma 6.4). As peA

nu'} = 0 (lemma 6.2) it

is sufficient to show that for every z EM, with the possible exception of a set C of measure 0, there exists a neighborhoodN(z) with peA nN(z»

= o. =p

We choose C to be the set of critical values ofX. (A point y E M is called a regular value ifrank(x'(z»

for all z E X-ley); otherwise y is called a critical value). By Sard's theorem (Milnor, 1965), C has measure O. Pick z E M nCc. We will first show that X-l(z) is a finite set {(Al,Vl),···,(Ak,Vk)}. Suppose on the contrary that there was an infinite set {(';1,Wl),(6,'W2),"'} with X(';i,'Wi) continuity of X, there would exist a cluster point ';0 of {';lJ6,"

= z.

By compactness of A and

oJ

and a corresponding 'Wo with

x (';0, 'Wo) =

z.

On the other hand, z was assumed to be a regular value of x, and thus X would be a diffeomorphism between a neighborhood of (AO''WO) and a neighborhoodof.z-.This isa contradiction. Beca,l1se zlsa regular value, there are neighborhoods Li(Ai, Vi) and a neighborhood N(z) such that X is a diffeomorphism between L; and N. Actually, a stronger statement holds. We can find N(z) C N(z) for which X-leN)

c

U~ Lj. Suppose this were not the case. Then there would exist a sequence

Zb Z2,'"

-

z and

corresponding (';i,Wi) ~ U:=lLi with X(';i,'Wi) continuity x(.;o, 'Wa)

= Zi.

The set {';1,6,··-}has a cluster point.;o ~ UL;, and by

= z, which is a contradiction.

y E

exists exactly one pair (A;(y),V;(Y» E L; with X(A;(y),V;(Y»

= y, and A;(y) is a smooth function of y. Define Ao(Y) = Am;n, Ak+l(Y) = Amaz. Set d;(y) = lIy -- I(A;(y»1I 2 · A simple calculation using the chain rule and the fact that (y -- I(A;(y», 1'(A;(y))} = 0 (lemma 6.1) shows that

grad(d;(y»

= 2(y- I(Ai(Y»). A point y E N(z) can be an ambiguity point only if y E Aii for some i =f; j

= {':E

lai(.:)

where

=

curve 1(>') was assumed not to intersect

However. for

smootn,

P<:)SSlOlY

-- dj

::/: 0, because

not connected, manifold of dimension p -- 1, which has measure 0, and

Note: We

anl'a,rs extend

tedmi<::ality: Sard's theoreltn reenues

an

Howe'ver. we

f in

a smooth

oouneanes of the interval.

26

In the following, let

/lg'(.\)/I::5 1. For

ball and for t

9

gB denote the class of smooth curves parametrized over A, with /lg(.\)11 ::5 1 and E gB, define f~('A) = f(.\) +tg(.\). It is easy to see that f~ has finite length inside any finite

< 1, .\ Ii is well defined. Moreover, we have

Lemma 4.1: If Z is not an ambiguity point for

t,

then

Proof: We have to show that for every e > 0 there exists 6 > 0 such that for all t < 6, 1.\1.(z) - Af(z)1

C

< e. Set

= An(.\f(z) Z

e,.\j(z) + e)e and de

= inf>"Ec liz -

f(.\)II. The infimum is attained and de

>

liz - f(.\j(z))lI,

because

is not an ambiguity point.

l~f (/lz -

ft(.\)/I-Ilz -

ft(.\f(z»ID ?: de - 6 -lIz

-

f(.\j(z))/I - 6

=6>0

o

We are now ready to prove: Proposition 4: The curve f is a principal curve of h iff

dD2(h, dt

It)l

t=O

-0 -

v 9 coB· sE v

Proof: Vie use the dominated convergence theorem to show that we can interchange the orders of integration and differentiation in the expression

We need to find a pair of integrable random variables which almost surely bound

Zt

= IIX -

ft(.\I.(X»1I

2 -

/IX -

f(.\f(X»1I

2

t

for all SUIUC1,ent,ly smaIl t Now

> O.

,t;xJpaI1ldiIlg the first norm we

=

27

- 2t

and thus

(13)

Using the Cauchy-Schwarz inel'!uality and the assumption that

lIgU :::5 1

Zt :::5 2!!X -- f(Aj(Xn!1

+1

IIX-- f(Ao)lI,

and therefore Zt-

:::5 211X -- f{Ao)1I

'Vt < 1 and arbitrary AO- As

Similarly we have

+1

IIXII was assumed to

be integrable, so is

Expanding the first norm

we get

Zt 2: --2 (X -- f(Aft(X»)-g(Aj,{X»

(14)

2: --211 X -- f{Aj,{X»1I

2: --21lX -- f{..\o) II ,

which is once again integrable. By the dominated convergence theorem, the interchange is justified. However, from

(13) and (14), and because f and g are continuous functions, we see that the limit limt _

O

Zt exists

whenever Aft (X) is continuous in t at t this limit is given by

= O. We have proved this continuity for a.e.

:ll

in lemma 4.1. Moreover,

by (13) and (14). Denoting the distribution of Aj{X) by h>., we get

If f{A) is a principal curve of h, then by definition E(X IAJ{X)

= l) = f(A) for a.e.

A, and thus

~D2(h, ft)l.

Conversely, suppose that

= 0 va E gB#=0

= A).

for all g E

=0

as

gB. l.,;orlSlder each coordinate separately, and re-express

unpties that

o

28

Construction of densities with known principal curves. Let

f be parametrized over a compact interval A. It is easy to construct densities with carrier in a tube around f, for which f is a principal curve.

Denote by B,. the ball in IRP-l with radius r and center at the origin. The construction is based on the

following: Proposition 7: If A is compact, there exists r> 0 such that X IA x B,. is a diffeomorphism. Proof: Suppose the result were not true. Pick a sequence r. - O. There would exist sequences (Ai, Vi) with

i: (ei, 'Wi)

111.1.11:5 ri, 1I'W.II:5 ri and

x(A., Vi) = X(ei,'Wi)'

The sequences Ai> ei have duster points Ao, eo. We must haveAo = eo, because

fCeo) = Xceo, 0)

= '·-'00 X(~i, Vi) .lim

= X(AO' 0) =f(Ao),

and by assumption f does not intersect itself. So there would be sequences (Ai, Vi) and

(~i, 'W')

converging to 0 and

(AO' 0) with X(Ai' Vi)

= X(ei, 'Wi).

However, it is easy to see that (AO, 0) is a regular point of X, and thus maps a

neighborhood of (AO, 0) diffeornorphically into a neighborhood of f(AO), which is a contradiction. Define T(f,r)

X(A x B,.). Proposition. 7 assures that there.a:reno

amLbi~:uij;y

Aj(;r;)

= A for ;r; EX(A,B,.).

= O. The mapping X carries the product

Pick a density (A) on A and a density 1/;(1.1) on B,. with fB,. 1.11/;(1.1)

density (A)' 1/;(1.1) on A x B,. into a density h(;r;) on T(f, r). It is easy to verify that f is a principal curve for h. References

389, Institute for

Mathematical Studies in the Social Sciences, Stanford University, California. Becker, R., Chambers, J. and Wilks, A. (1988), The New S Language, Wadsworth, New York. Boynton, R.M. and Gordon, J.

«

ljeZold-jjrtlke Hue Shift Measured

Color-N'mling TechIliq1ue" ,Journal

of the

Carroll, D.J.

of America, 55 ,

~P()lYIlonlial Factor

«

Al1laljrSU.", Proceedings of the 77th Annual C01nvent:ion, APA, 103-104.

vi1eig;htt:~d Re~'ession

Robust lJu,c;auy

and Smoothmg Scatterplots"

~lIH''Nl.,''1

of the

American Statisti.cal.t!sso,::at:on, 74,

<:S"l;-(j;)O.

29

Efron, B. (1982), "

Jacknife, the Bootstrap and other Resampling Plans", SIA}d-CBlY/S, 38.

Etezadi-Amoli, J. and McDonald, R.P. (1983), "A Second Generation Nonlinear Factor Analysis", PsychometriJ:a,

48, #3, 315-342.

Golub, G.H. and van Loan, C. (1979), " Total Least Squares", in Smoothing Techniques for Curve Estimation, Proceedings, Heidelberg, Springer Verlag. Hart, J. and WehrIey, T. (1986), "Kernel Regression Estimation Using Repeated Mea surement Data", Journal

of the American Statistical Association, 81, 1080-1088.

Hastie, T.J. (1984), " Principal Curves and Surfaces", Lab. for Comp. Stat. tech. rep. Statistics Departmen,t, Stanford University.

# 11 and Ph.d. thesis,

Milnor, J.W. (1965), Topology from the Differentiable Vie'Wpoint,University Press of Virginia, Charlottesville. Shepard,. R.N. and Carl'oll, D.J. (1966), " Parametric Representations of Non- Linear Data Structures", in

Multivariate Analysis (Krishnaiah, P.R., ed), Academic Press, New York.

Silverman, B.'VV. (1985), "Some Aspects of Spline Smoothing Approaches to Non-Parametric Regression Curve Fitting", Journal of the Royal Statistical Society B, 47, 1-52. Stone, M. (1974), "Cross-validatory Choice and Assessment of Statistical Predictions (with Discussion)", Journal

of the Royal Statistical Society B, 36, 111-147.

Thorpe, J.A. (1979), Elementary Topics in Differential Geometry, Springer-Verlag, New York. Undergraduate Text in Mathematics. Wabba, G. and Wold, S. (1975), " A Completely Automatic French Curve: Fitting Spline Functions by Crossvalidation" ,Communications in Statistics, 4, 1-7. Watson, G.S. (1964) "Smooth Regression Analysis", Sankya Series, A, 26, 359-372.

FIGURE 1. (a) (b)

regression line

the sum of squared deviations in the response

principal component line minimizes the sum of squared deviations in the sum of squared deviations in the response

variables. (c) The smooth regression curve

variable, subject to smoothness constraints (d) The principal curve minimizes the sum of squared deviations in all the variables, subject to smoothness constraints. FIGURE 2. The radius of curvature is the radius of the circle tangent to the curve with the same acceleration as the curve. FIGURE 3. Each point on a principal curve is the average of the points that project there. FIGUl:tE 4·.Themean of the observations projecting onto an endpoint of the curve can be disjoint the rest of the curve. FIGURE 5. Selected iterates of the principal curve procedure for the circle data. In all the figures we see the data, the circle from which the data is generated, and the current estimate produced by the algorithm. (a)The starting curve is the principal component line, with average squared distance

D 2(j:(O = 12.91. (b) Iteration 2: D2(j:(2)) = 10.43. (c) Iteration 4: D2(f(4)) = 2.58. (d) Final ))

iteration 8: D2(f(8)) = 1.55. FIGURE 6. These schematics emphasize the iterative nature of the algorithm. The curve of the the :first iteratiqnisa function ofX(O).rneasuredalongtheistarting second iteration is a function of

A(I)

curve of the the

measured along the curve of the first iteration (b).

FIGURE 1. (a) The final curve in figure 6 with one more smoothing step, using cross-validation separately for each of the co-ordinates - D2(f(9)) the iterations using cross-validation at every step. FIGURE 8. The points represent three dimensional data generated from a helix with gaussian errors. The dashed curve is the generating helix, and the solid the curve fitted to the data. All are projected onto a plane chosen to give a reasonable view of the results. FIGURE 9. Some curves preaueed by curve found stctrti,nl:!: is

<,"'U·"''''\L.

= 1.28.

(b) The curve obtained by continuing

algorithm applied to biv.lXiate spinerlcaJ. gau,ssict.Il

(a)

circle

aJg;OrithlJ:l is started at a or a but

centered at

mean.

a smcotner ensures

(c) The curve found

is genersrec location on is seieceec unnornuy

31

errors.

than the generating curve. 11.( a) A rough schematic of the Stanford linear accelerator and the linear collider ring. (b) fitted coordinate for the positions for a section of SLC. The data represent residuals from the theoretical curve. Some (35%) of the deviations from the fitted curve were small enough that these magnets were not moved. FIGURE 12. (a) Plot of the log assays for the inhouse and outside labs. The solid curve is the principal curve, the dashed curve the scatterplot smooth. (b) A band of 25 bootstrap curves. Each curve is the principal curve of a bootstrap sample. A bootstrap sample is obtained by randomly assigning to the principal curve solid curve, curve). The bandof curves appears to be spread ofthe curves gives an indication of

varia:nce. (c) Another band of 25 bootstrap curves. Each curve is the principal curve of a bootstrap sample, based on the linear errors in variables regression line (solid line). This simulation is aimed at testing the null hypothesis of no kink. There is evidence that the kink is real, since the principal curve (solid curve) lies outside this band in the region of the kink. FIGURE 13. (a) The 4 dimensional color data and the principal curve projected onto the first principal component plane. (b) The estimated co-ordinate functions plotted against the arc length of the principal curve. The arc length is a monotone function of wavelength. FIGURE 14. Each point on a principal surface is the average of the points that project there.

32

If,

Ie

o

o

0

o o

o

o

o

0

o o o

o o

·0

o

I_ .

_ _ _ _1

3

e.

'.

-.

,

..

· ·

· ·· · · · · · · ·· · · · · · · · · .' · · · · · ·· " · ··· · · ·· · · · · ·· · · · · ·· · · ·· · · · · · · ·· · · t · · · ·· · ·

..

.-

..

..

-,

,

· · · · · · · ·· · ·· · · · ·· · · · ·· · · ,

.....

· · · . 1· · · ·· ·

-

·

· · · · · ·

"

..

, ··· ·

·· ·

"[

·

start: PC line - 12.91

iteration 1: - 10.43

· · · ··

.. ·

·

·

· ·

· · · · · · ·· ·· "

·

..

·

·

·

·

·

· · ··

· ·

·

..

· ··· ·

· · ·

· · ·

· · ·

iteration 4: - 2.58

final - 1.55

:'

· · · ·

· · · ·

·

·

·

CV final fit - 1.28

CV'd iteratively: - .12

·

· ·

·

·

· ·· ·

· · · · ·

·

··

· ·

·

·

·

·

·

··

·

·

·

·

·

· ··

·

8

·

··

·

·· ··

·

.. .

· ·· ··,.

· · ·

· ·· ·

· · ·· ·· ·· .,. · · ·

,

· ··

· ·· · · ·· ··

til

· ··

· .. ··

·

· · ·

I

·

· ·

· ·

·

·

·

:.:::6· ..

·

· ·· · ... ·-··· .... ·· ... .

'II ··

· ·

·

·

·

~:.

··

·

· ·

... ...

... ...

... ...

"-...

...

"

, ,

I

f f f

f f I

I I I I

I

I

I

I

I

...... '" '"

'"

, ,,

,

, ,,

I

I

10

collision chamber

I

linear accelerator

II

o,

o

100

200

arc lengltl (m)

'.

I

H

..

.

..

'

.

. .. ,;---,--=...._........

.'

'

..

"l

. -

o

100

200

arc lengltl (m)

fib

=

,...

o

o

1

2

3

4

inhouse assay (a)

o

I

I

a

2

inhouse assay

(c)

4

6

12 b

o

,',"

","

" "

..-"

" ..- "

o

2

inhouse assay (b)

4

6

1:& c

First principal aXIS

60

rJ)

\

o .....

...,J

~

'red \ yellow ....

~eep.

~ 40

Q)

...,J

o

,

.

/

...

I

\

\

/

\

/\

I

/ /

X

\

.\ \

.8

H

a:s

. \

\ \

J

I '. I

"C

f20

o o

"C

Q)

...,J

I

/

,:

. \

...,J rJ)

S .....

r:i!

a:s

0

-_

/

(I)

I

.

\

...........:...;..----_ ......

0 0 0 0 0 0

" --- . _.

"'" F . - . - .""'.--....

..,.",

.

.......:

.

00 0

00

co

0

o

000

CD

o

50

100

150

200

250

Estimated A (wavelength)

1'3 b

·

·

·

· ·

· ·

f(X) ·

·

·

·

14-

Information

52 pages

Report File (DMCA)

Our content is added by our users. We aim to remove reported files within 1 working day. Please use this link to notify us:

Report this file as copyright or inappropriate

1226049


You might also be interested in

BETA
Determining Appropriate Stiffness Levels for Spudcan Foundations Using Jack-Up Case Records
FE Supplied-Reference Handbook - NCEES
rs003101 1..9
Soil bacterial and fungal communities across a pH gradient in an arable soil
fly105