Read tr151.pdf text version
PRINCIPAL CURVES
by
Trevor Hastie Werner Stuetzle
TECHNICAL REPORT No. 151 Revised October 1988
Department of Statistics, GN22 University of Washington Seattle, Washington 98195 USA
CURVES
AT&
Murray and
LV.,.....,~,'" Stuetzle
lIm
New Jersey, 07974
IJe.va:rtl11.e1:lt of Statistics University of Washington Seattle, 98195
AUTHORS' FOOTNOTE Trevor Hastie is a Member of the Technical Staff at AT&: T Bell Laboratories, Murray Hill, New Jersey 07974. Werner Stuetzle is Associate Professor, Department of Statistics, University of Washington, Seattle, 98195. This work was developed for the most part at Stanford University with partial support from the Department of Energy under contracts DEAC0376SF and DEAT0381ER10843, and the Office of Naval Research contract ONR N0001481K0340, and by the U.S. Army Research
Anareas Buja, Tom Duchamp, lain Johnstone, and Larry Shepp for their theoretical support; Robert 'I'ibshirani, Brad Efron and Jerry Friedman for many helpful discussions and suggestions; Horst Friedsamand Will Oren for supplying the SLC example, and their help with the analysis; both referees for their constructive criticism of earlier drafts.
Abstract
Principal curves are smooth one dimensional curves that pass through the middle of a p dimensional data set, providing a nonlinear summary of the data. They are and their shape is suggested by the data. The algorithm for constructing principal curves starts with some prior summary such as the usual principal component line. The curve in each successive iteration is a smooth or local average of the p dimensional points, where the definition of local is based on the distance in arc length of the projections of the points onto the curve found in the previous iteration. In this paper we define principal curves, give an algorithm for their construction, present some theoretical results, and compare our procedure to other generalizations of principal components. Two applications illustrate the use of principal curves. In the first application, two different assays for gold content in a number of samples of computer chip waste appear to show some systematic differences which are blurred by measurement error. The classical approach using linear errors in variables regression can detect systematic linear differences, but is not able to account for nonlinearities. 'When we replace the first linear principle component by a principal curve, a local "bump" is revealed, and we use bootstrapping to verify its presence. As a second example we describe how the principal curve procedure was used to align the magnets of the Stanford Linear Collider. The collider uses around 950 magnets in a roughly circular arrangement tobendelectronand positron beams and bring them to collision. Mterconstructionitwas found that some of the magnets had ended up significantly out of place. As a result the beams had to be bent too sharply and could not be focused. The engineers realized that the magnets did not have to be moved to their originally planned locations but rather to a sufficiently smooth arc through the "middle" of the existing positions. Tills arc was found using the principal curve procedure.
Keywords: Principal Components, Smoother, Nonparametric, Nonlinear, L.i~~""'" in variables, Symmetric, SelfConsistency.
1. Introduction
Consider a data set consisting of
17.
observations on two variables, x and y.
can represent the
17.
points in a scatterplot, as in figure La, It is natural to try and summarize the pattern exhibited by the points in the scatterplot. The form of summary we choose depends on the goal of our analysis. A trivial summary is the mean vector which simply locates the center of the cloud but conveys no information about the joint behavior of the two variables. It is often sensible to treat one of the variables as a response variable, and the other as an explanatory variable. The aim of the analysis is then to seek a rule for predicting the response using the value of theexpla.natory variable. Standard linear re~ressionproduces a linear ptediction rule. expectation OfYiis mOdeled as a linear function of x and is usually estimated by least squares. This procedure is equivalent to finding the line that minimizes the sum of vertical squared deviations, as depicted in figure Ia,
In many situations we do not have a preferred variable that we wish to label response, but would
still like to summarize the joint behavior of x and y. The dashed line in figure La shows what happens if we used x as the response. So simply assigning the role of response to one of the variables could lead to a poor summary. An obvious alternative is to summarize the data by a straight line that treats the two variables symmetrica.ilY' The first principal component line in. figure lb does just this by minirxrizin.g.theClrthogonaldeviations. Linear regression has been generalized to include nonlinear functions of z. This has been achieved using predefined parametric functions, and with the reduced cost and increased speed of computing nonparametric scatterplot smoothers have gained popularity. These include kernel smoothers (Watson 1964), nearest neighbor (Cleveland 1979), smoothers (Silverman 1985 and minimize the vertical it is found
general sl:attel'Pl()t smoothens produce a curve
deviations as depicted in figure Ic, subject to some form of smoothness constraint. The nonparametric versions referred to above allow the data to dictate the form of the nonlinear aepenoency
paper we COl1S111er similar generauzanens for a str;3.igllt curves we use a SILlOC>tn curve; pass throegh
symmernc srtuanon. Instead of summanzcurve we treat two
¥<LJr14UU::;:'
waetner or
situation is depicted
curves,
3
linear principal components, focus on the orthogonal or shortest distance to This means that
points. vVe formally
define principal curves to be those smooth curves that are self consistent for a distribution or data set. we pick any point on the curve, collect all the data that "project" onto this point and average them, then this average coincides with the point on the curve. The algorithm for finding principal curves is equally intuitive. Starting with any smooth curve (usually the largest principal component), it checks if this curve is self consistent by projecting and averaging. If it is not, the procedure is repeated, using the new curve obtained by averaging as a starting guess. This is iterated until (hopefully) convergence. The largest principal component line plays roles other than that of a data summary: · In errors in variables regression it is assumed that.tll.ere is randomness in the predictors as wellas the response. This can occur in practice when the predictors are measurements of some underlying variables, and there is error in the measurements. It also occurs in observational studies where neither variable is fixed by design. The errors in variables regression technique models the expectation of y as a linear function of the systematic component of z , In the case of a single predictor, the model is estimated by the principal component line. This is also the total least squares method of Golub and van Loan (1979). More details are given in an example in section 8. · Qftenwe want to replace a nwn~erof highlY correlated vari.a,bles by a single varia.ble, such as a normalized linear combination of the original set. The first principal component is the normalized linear combination with the largest variance.
I
· In factor analysis we model the systematic component of the data by linear functions of a small
set of unobservable variables called factors. Often the models are estimated using linear principal one use the largest principal component. Many variations of this model have appeared in the literature. In all the above situations the model can be written as
::Ci
=
random component.
assume
(1)
UO+aAi is
the
squares estimate is
::Ci
principal component.
=
4
+ ei.
This might then be a factor analysis or structural model, and for two variables and some restrictions an errors in variables regression model. In the same spirit as above, where we used the first linear principal component to estimate (1), the techniques described in this article can be used to estimate the systematic component in (2). This paper focuses on the definition of principal curves, and on an algorithm for finding them. We also present some theoretical results, although lots of open questions remain.
2. The principal curves of a probability distribution
We first give a brief introduction to one dimensional curves, and then define the principal curves of smooth probability distributiQns in p space. Subsequent sections give algQrithms for finding the curves, both for distributions and finite realizations. This is analogous to motivating a seatterplot smoother, such as a moving average or kernel smoother, as an estimator for the conditional expectation of the underlying distribution. We also discuss briefly an alternative approach via regularization using smoothing splines.
2.1. One dimensional curves
A one dimensional curve in p dimensional space is a vector f(>..) of p functions of a single variable A. These functions are called the coordinate functions, and A provides an ordering along the curve.
If the coordinate functions are .smooth, then f is by definition a smooth curve. We can apply any
monotone transformation to >.., and by modifying the coordinate functions appropriately the curve remains unchanged. The parametrization, however, is different. There is a natural parametrization for curves in terms of the arclength. The arclength of a curve
1= If
f from >"0 to >"1 is given by
1
'\0
'\ 1
IIf'(z)1I dz:
IIf'(z)1I == 1 then 1 = >"1  >"0. This is a rather desirable situation, since if all the coordinate variables
The vector 1'(>") is tangent to the curve at A and is sometimes called the velocity vector at A. A
are in the same units of measurement, then>" is also in those units.
curve curve
== 1 is
a
speed
curve. 'YVe can our curve, smootnness
any smooth of smoothness relates
more naturally to
translates directly into amoovn visual appearance of
point set {f(A), >.. E A} (absence of sharp
5
If 11 is a unit vector, then f(>..) =
unique i"'( >..) = u we will always assume that {U,11}
110
+ >"11
is a unit speed straight line. This parametrization is not In the following
+ a11 + >"11 is another unit speed parametrization for the same line.
= O.
fill II f" II is called the principal normal
The vector fll(>..) is called the acceleration of the curve at >.., and for a unit speed curve, it is easy to check that it is orthogonal to the tangent vector. In this case to the curve at >... The two vectors f'(>,,) and f"(>..) span a plane. There is a unique unit speed circle in the plane that goes through f(>..) and has the same velocity and acceleration at f(>..) as the curve itself. The radius rf(A) of this circle is called the radius of curvature of the curve
f at A; it is easy to see that rfC>") = 1/1If"(A)II. The center Cf(A) of the circle is called the center of curvature of f at >..
Thorpe (1979) gives a clear introduction to these and related ideas in differential geometry.
II Insert figure 3 around here
2.2. Definition of principal curves
II
Denote by X a random vector in lRP with density h and finite second moments. Without loss of generality assume E(X)
= O. Let f
denote a smooth (COO) unit speed curve in lRP parametrized over
A ~ JRl, a dosed (possibly infinite) interval, which does not intersect itself (AI and has finite length inside any finite ball in lRl'. We define the projection index Af: lRl'
>"f(~)
+
#
A2 => f(>"l)
# f(A2»,
JRl by
= SUp{A: II~ ).
f(>")11
= infll~ Ii
f(,u)II}·
~.
(3)
If there are several such
The projection index Af(3;) of.~ is the value.of >.. for which f(A) is closest to
values, we pick the largest one. We show in the appendix that Af( e ) is well defined and measurable.
Definition
The curve f is called selfconsistent or a principal curve of h if E(X IAf(X) = A) = f(>..) for a.e. A. Figure 3 illustrates particular parameter curve. If f( A) is principal curve. In the we our algorithms to come; we intuitive motivation Deluna our dennitron of a prmcrpai curve. For any
>.. we couect all
observations, and if
have f(>..) as holds for
closest point on
f
is called a
a aeignbcracoc
on
curve.
~v".r~<rin($'
to estimate conditionai expectations.
6
The definition of principal curves immediately
rise to a number of interesting questions: for principal curves are there for a
what kinds of distributions do principal curves exist, how many
given distribution, and what are their properties? We are unable to answer those questions in general. vVe can, however, show that the definition is not vacuous curves. It is easy to check that for ellipsoidal distributions the principal components are principal curves. For a spherically symmetric distribution, any line through the mean vector is a principal curve. For any two dimensional spherically symmetric distribution, a circle with center at the origin and radius E there are densities which do have principal
IIXII
is a principal curve. (Strictly speaking, a circle does not fit our definition, because it does intersect itself. See,however, our note at the beginning of the appendix.) We show in the appendix that for compact A it is always possible to construct densities with carrier in a thin tube around
f, which have f as a principal curve.
What about data generated from the model X
= f(A) + c, with
f smooth and E(c)
= O.
Is
f a
principal curve for this distribution? The answer seems to be no in general. We will show in section 7 in the more restrictive setting of data scattered around the arc of a circle that the mean of the conditional distribution of :v, given A(:V) = Ao, lies outside the circle of curvature at Ao; this implies that We have some evidence that this bias is small, and
f cannot
be a principal curve. So in this situation the principal curve will be biased for the functional model. decreases to zero as the variance of the errors gets small relative to the radius of curvature. We discuss this bias as well as estimation bias (which fortunately appears to operate in the opposite direction) in section 7.
3. Connections between principal curves and principal components
In this section we establish some facts that make principal curves appear as a reasonable generalization of linear principal components. Proposition 1: If a straight line leA) Proof: The has to pass
ttll~ou,~h
= Uo + AVo is self consistent, then it is a principal component.
v.,.&u<,
the
because
0= E(X) =
(X IAf(X) = A)
=E,\(Uo+ =Uo+
we assumed
UQ
.L
remains to
7
the covariance of X:
Evo = E(XXt)vo
= E"E(XXtvo /Aj(X) = A) = E"E(XXtvo Ixtvo = A) = E"E(AX Ixtvo = A)
= E" A2 VO
o
Principal components need not be self consistent in the sense of the definition; however, they are self consistent with respect to linear regression. Proposition 2: Suppose leA) is a straight line, and we linearly regress the p components Xj of X on the projection Ae(X) resulting in linear functions fj(A). Then
1£0
1= l
iff Vo is an eigenvector of E and
= o.
The proof of this requires only elementary linear algebra and is omitted.
3.1. A distance property of principal curves An important property of principal components is that they are critical points of the distance from the observations. Let d(3::,/) denote the usual euclidean distance from a point 3:: to its "projection" on I: d(3::,/)
=
113::  I(Aj(3::»II, and define
Consider a straight line leA) = 1£+A11. The distance D 2(h,/) in this case may be regarded as a function of 1.£ and 11: D2(h,l) == b 2(h , 1£, v ). It is well known that gradu,vD2(h,1£,v) 0 iff 1£ 0 and v is an
=
=
eigenvector of E, i.e., the line l is a principal component line. vVe now restate this fact in a variational setting and extend it to principal curves. Let class of curves parametrized over A. For 9 E
9 denote a
9 define It = I
+ tg.
This creates a perturbed version of
I·
Definition curve
I
is
a entreat point of the distance function
variations
the class
9 iff
=
8
9E
Proposition 3: Let (h denote the class of straight lines g(A)
= a+Ab. A straight line lO(A) = aO+Abo
is a critical point of the distance function for variations in (h iff vo is an eigenvector of Cov(X) and
ao = O.
The proof involves straightforward linear algebra and is omitted. A result analogous to proposition 3 holds for principal curves. Proposition 4: Let (iB denote the class of smooth (C CO ) curves parametrized over A, with and IIg'lI .$ 1. Then perturbations in (i B. A proof of proposition 4 is given in the appendix, The condition that Ilg11 is bounded guarantees that
IIgli
.$ 1
1 is a principal curve of h iff 1 is a critical point of the distance function for
it lies in a thin tube around 1 and that the tubes shrink uniformly, as t ;. O. The boundedness of
IIg'llensures that for t small enough,
I: is well behaved and in particular is bounded away from 0 for
Aft
t < 1. Both conditions together guarantee that, for small enough t,
4. An algorithm for finding principal curves
is well defined.
In analogy to linear principal component analysis, we are particularly interested in finding smooth
curves corresponding to local minima of the distance function. Our strategy is to start with a smooth curve such as the largest linearprincipal component, and check if it is a principal curve. This involves projecting the data onto the curve, and then evaluating their expectation conditional on where they project. Either this conditional expectation coincides with the curve, or we get a new curve as a byproduct. We then check if the new curve is self consistent, and so on. If the self consistency condition is met, we have found a principal curve. It is easy to show that both the operations of projection and conditional expectation reduce the expected distance from the points to the curve.
4.1. The principal curve algorithm
The above discussion motivates the following iterative algorithm. initialization: Set 1(0)(>.) = z+a>. where a is the the first linear principal component of h. Set A(O)(::e) = >'j<O)(::e) repeat: over iteration counter j
1)
I(j)(·) =
.u'CJlll'C
IAj<i1) (X) = .)
= >'f(j)(::e) V::e E
so
l(j)
2)
is
=
some tnresncta.
9
There are potential problems with the above algorithm. While principal curves are by definition differentiable, there is no guarantee that the algorithm have curves produced by the conditional expectation step of
property. Discontinuities can certainly occur at the endpoints of a curve.
problem is illustrated in figure 4, where the expected values of the observations projecting onto f( Amin) and f( A =:) are disjoint from the new curve. m
If this situation occurs, we have have to join f(Amin) and f{Ama:r;) to the rest of the curve ina
differentiable fashion. Inl1ghtor the above, we cannot prove that the algorithm converges. All we have is some evidence in its favor:
· By definition principal curves are fixed points of the algorithm.
· Assuming that each iteration is well defined and produces a differentiable curve, we can show that the expected distance D2(h,f(j) converges.
· If the conditional expectation operation in the principal curve algorithm is replaced by fitting a
least squares straight line, then the procedure converges to the largest principal component.
5. Principal curves for data sets
So far, we have considered principal curves of a multivariate probability distribution. In reality, however, we always work with finite multivariate data sets. Suppose then that X is an n
X
p matrix of
n obsersatiens on p variables. 'Ve regard the data set as a sample from an underlying probability
distribution. A curve f(A) is represented by n tuples (Ai,fi), joined up in increasing order of A to form a polygon. Clearly the geometric shape of the polygon depends only on the order, not on the actual values of the Ai. 'Ve will assume that the tuples are sorted Al = 0 and Ai is the increasing order of A, and we along the from use the arclength parameterization, for
fl to fi.
is
distributlen case,
algorithm alternates between a projection
an expectation as a starting curve;
absence of to be
mforrnation we use
pr()jei:ti()ns of
n cbservations onto
is
some taresnold
arssance is estimated.
addmg
up
squared distances of the points in the sample to their closest points on the current curve. We are unable to prove that algorithm converges, nor that each step guarantees a decrease in the criterion. In
practice we have had no convergence problems with more than 40 real and simulated examples.
5.1. The projection step
For fixed fW(.) we wish to find for each Zi in the sample the value Ai Define
= Af(j)(Zi).
dik as the distance between Zi and its closest point on the line segment joining each pair
(f(i)(A~»), fW(A~ll))' Corresponding to each dik is a value Aik E [A~), A~lIJ. We then set Ai to the
Aik corresponding to the smallest value of dik:
(4)
Corresponding to each Ai is an interpolated replace Ai by the arc length from fii) to fi(i).
5.2. The conditional expectation stepScatterplot smoothing
f.ei ); using these
values to represent the curve, we
The goal of this step is to estimate f(i+1)(A)
= E(X
IAf(j)
= A).
We restrict ourselves to estimating
this quantity at only n values of A, namely AI,'" ,An found in the projection step above. A natural way of estimating E(X I Af(j) = Ai) would be to gatherall the observations that projectontoj(i) at Ai, and find their mean. Unfortunately there will in general be ocl.y one such observation, namely Zi. It is at this stage that we introduce the scatierplot smoother, a fundamental building block in the principal curve procedure for finite data sets. We estimate the conditional expectation at Ai by averaging all the observations Zk in the sample for which Ak is close to Ai. As long as these observations are close enough and the underlying conditional expectation is smooth, the bias introduced in approximating the conditional expectatioIl,willbe small. On the other hand, the variance of include more observations in the neighborhood.
5.2.1 Scatterplot smoothing
estimate decreases as we
Local averaging is not a new to estimate are
A<::J.ll<::J.
In the more common regression context, seatterplot smoothers are
averagsag, Some commonty
Cleveland (1979). These smooth a one cimen
smoosners
locally weigated runnmg response against tion
In our case,
VdJ,"li::l.Ult:l
to
smeetnea is p dimeasionat; so we
current rmptementancn any scatterptot
11
;:)Hl,VVUI':a
to
experience with all those mentioned above, although all the examples were fitted using locally weighted running lines. We give a brief description; for details see Cleveland (1979).
5.2.2 Locally weighted running lines smoother
Consider the estimation of E (x
IA), Le.
a single coordinate function, based on a sample of pairs
(At,XI),"', (An, x n ), and assume the Ai are ordered. To estimate E(x IA), the smoother fits a straight line to the wn observations {Xj} closest in Aj to Ai. The estimate is taken to be the fitted value of the line at Ai. The fraction w of points in the neighborhood is called the span. In fitting the line, weighted least squares regression is used, The weights are derived from a symmetric kernel centered at Ai that dies smoothly to zero within the neighborhood. Specifically, if hi is the distance to the wn'th nearest neighbor, then the points
s, in the neighborhood get weights Wij = (1
I '\i.,xi r)3.
5.3. A demonstration of the algorithm
To illustrate the principal curve procedure, we generated a set of 100 data points from a circle in two dimensions with independent Gaussian errors in both coordinates:
X ( Xzl) =
(5sin(A)) 5 cos(A)
+
(e
l) ez
(5)
where A is uniformly distributed on [0,211") andel and ez are independent N(o, 1).
II Insert figures
5ad about here
II
Figures 5ad show the data, the circle (dashed line) and the estimated curve (solid line) for selected steps of the iteration. The starting curveis the fitstprincipal component, in figure 5a. Any line through the origin is a principal curve for the population model (5), but in general this will not be the case for data. Here the algorithm converges to an estimate for another population principal curve, the circle.
example is admittedly artificial, but tough job.
n"D<:",nf'<:
principal curve procedure with a particularly
starting guess is wholly Inappropriate
projection of the points onto this solution curve. is a new are
do not .... "",..1" represeat
12
found points along
points onto the new curve. It can be seen that the new curve can be
of the projeeteo
different to the ordering along the previous curve. This enables parametrized as a function of the
the successive curves to bend to shapes that could not principal component. 5.4. Span selection for the scatterplot smoother
The crucial parameter of any local averaging smoother is the size of the neighborhood over which averaging takes place. We discuss here the choice of the span w for the locally weighted running line smoother.
A fixed span .<::trlltefl11.
The common first guess for.f is will not be a function of the arc length
In
ofth.e i:nt¢restingsituations, the final curve
initfalcurve (see:fi.gure 6). It is reached by successively
bending the original curve. We have found that if the initial span of the smoother is too small, the
curve may bend too fast, and follow the data too closely. Our most successful strategy has been to initially use a large span, and then to decrease it gradually. In particular, we start with a span of O.6n observations in each neighborhood, and let the algorithm converge (according to the criterion outlined above). We then drop the span to O.5n and iterate till convergence. Finally the same is done at OAn by which time theprocedure.h.as found fou:nd usingth.isstrategy. Spans of this magnitude have frequently been found appropriate for scatterplot smoothing in the regression context. In some applications, especially the two dimensional ones, we can plot the curve and the points and select a span that seems appropriate for the data. Other applications, such as the The curves ·in figure 5 where
collider ring example in section 8, have
A utomatic span selection by crossvalidation.
.c>.""U.'''''O the peoeedure has converged to a
consistent (with respect to the smoother) curve for the to density of data. As
span last used. we reduce
do not want
fitted curve to be too wiggly
averaee dlstaIICe decreases and
lllaJ'U.,LILl;5
curve follows fideUty to
data more rk,;;plv we
human eye is skllled at
t17adeotts between smootnness
this jUd.gm.ent automatrcany. A similar situation arises in nonparametric regression, we
a response y
a covariate
smootnness juagmene autoltna.tlc:alIlY is to ensure
proceeds as toU.OWS.
13
response Yi in
using a smooth estimated from the and define an approximately unbiased estimate of
with the ith observation
let
Y(i)
be this predicted too
cross validated residual sum of
CVRSS =

Y(i)Y~' CVRSS/n is
expected squared prediction error. H the span
the curve will miss features in the data, and the bias component of the prediction error will dominate. H the span is too small, the curve begins to fit the noise in the data, and the variance component of the prediction error will increase. We pick the span that corresponds to the minimum CVRSS.
In the principal curve algorithm, we can use the same procedure for estimating the spans for each
coordinate function separately, as a final smoothing step. Since most smoothers have this feature built in as an option, cross validation in this manneris trivial to implement.
Figure 7(a) shows the final curve after one more smoothing step, using cross validation to select the spannothing much has changed. Figure 7b,on the other .hand, shows what happens if we continue iterating with the cross validated smoothers, The spans get successively smaller, until the curve almost interpolates the data. In some situations, such as the SLC example in section 8, this may be exactly what we want. It is unlikely, however, that In this eventcross.validationwouIdbe used.to pick the span' A possible explanation for this behavior is that the/errors in the coordiIl.<l.tefunctions are autocorrelated; cross validation in this situation tends to pick spans that are too small (Hart and Wehrly, 1986).
5.5. Principal curves and splines
Our algorithm for estimating principal curves from samples is motivated by the algorithm for prmcrpat curves. This is analogous to the motivation for kernel smoothers and locally weighted running lines smoothers. They a conditional expectation, a population quantity that minimizes a population criterion. criterion. hand, cubic smooenrng penalty (smootlling parameter] p. minimizes
Xi 
They do not minimize a data 5moot;hiIltg spnnes, on a set
11.
·.· , (An,
D 2 (J) =
I)
i=l
n
(6)
E
i
= 1"",11.
14
so that
Dzef,>')
is over
= tll~i i=1
f(>'i)lI Z + p
r lIf (>.)lizo; l«
ll
(7)
f
with
Ii
E S2[O,
Notice we have confined the functions to the unit interval, Intuitively, for a fixed smoothing parameter p,
and thus do not use the unit speed
functions defined over an arbitrarily large interval can satisfy the secondderivative smoothness criterion and visit every point. It is easy to make this argument rigorous. We now apply our alternating algorithm to this criterion · given min.imi;z;ing DZ(f, >.) res,ealE!d to lie · Given.>'i, only involves the first part of (7) and is our usual pr()je<:ticln
up. into p expressions of the form (6), one for each coordinate function.
These are optimized by smoothing the p coordinates against >'i using a cubic spline smoother with parameter p. The usual penalized least squares arguments show that if a minimum exists, it must be a cubic spline in each coordinate. We make no claims about its existence, or about global convergence properties of this algorithm. An advant~geofthe.sPMIle smoothin~)alg9rithm Is.that it can be c0ntputed in O(n) operations, andth"Us is.a.strOIlg competitor for the kernel typesmoothers which take O(n Z) unless approximations are used. Although it is difficult to guess the smoothing parameter p, alternative methods, such as using the approximate degrees of freedom (see Cleveland, 1979) are available for assessing the amount of smoothing and thus selecting the parameter. Our current implementation of the algorithm
a
of smooenmg splines
Q.1111.cUlt to distinguish their performance in practice.
5.6. Further illustrations and discussion of the algorithm
The procedure worked
on the
example, surprjsinj!Z;, at
on a number of other
artitir:iaJ
examples. set
COJ£lsilier a
a a
A
prmcrpat curve, as are observations than the circle, however
15
'When the principal curve procedure is started
the circle, it does not move much, except at
the endpoints, as depicted in figure 9a. This is a consequence of the smoothers endpoint behavior in that it is not constrained to be periodic. Figure 9b shows what happens when we use a periodic version of the smoother. However, starting from the linear principal component (where theoretically it should stay), the algorithm iterates to a curve that, apart from the endpoints, appears to be attempting to model the circle (see figure 9c; this behavior occurred repeatedly over a number of simulations of this
j
example. The, ends of the curve are "stuck" and further iterations do not free them. The example illustrates the fact that the algorithm will tend to find curves that are minima of the distance function. This is not surprising; after all, the principal curve algorithm is a generalization of the power method for taken. Interestingly, the algorithm using the periodic smoother and starting from the linear principal component finds the identical circle to that in figure 9b. 6. Bias considerations: model and estimation bias Model bias occurs when the data are of the form z = f(>..)+e, and we wish to recover f(>..). In general, if f(>..)hascu.ryature,itwilLnot be a. principalcurveforthedistriblltion the principal curve procedure can only find a biased version of f(>"), even if it As a consequence at the generating which 'e.."thibits behavior. The power method will, tend to converge to an eigenvector for the largest eigenvalue, unless special precautions are
curve. This bias goes to zero with the ratio of the noise variance to the radius of curvature. Estimation bias occurs because we use scatterplot smoothers to estimate conditional expectations. The bias is introduced by averaging over neighborhoods, which usua.lly has a flattening effect. We demonstrate this bias with a simple example.
6.1. A simple model for investigating bias
curve f is an arc of a from a bivariate
closest to
(J
with center more mass is
p,
z are generated
mean cnosen uniformly on
srtuanoa. Il[lttutively: it seems
centered at
arc
to
a point, the
se~:nu~nt shrinks
down to
to the curve at that
but there is always more
mass outside
circle than
implies that the conditional expectation lies outside the circle.
We can prove (Hastie, 1984) that
where
re = r
*sin(8/2) 8/2 '
(8)
and
Finally r* :,. p as a / p :,. O. Equation (8) nicely separates the two components of bias. Even if we had infinitely many observations, and thus would not need local averaging to estimate conditional expectation, the circle with radius p would not be a stationary point of the algorithm; the principal curve is a circle with radius r" The factor
>
p,
Sin£i£2)
is due to local averaging. There is clearly an optimal span at which the two bias
components cancel exactly. In practice this is not much help since we require knowledge of the radius of
curvatureandtb.~.error variance is needed to determine it. Typically these .quan.tities·win be cha~ging
as we move along the curve. Hastie (1984) gives a demonstration that these bias patterns persist in a situation where the curvature changes along the curve.
7. Examples
This section contains three examples that illustrate the use of the procedure.
7.1. The Stanford Linear Collider (SLC) project
This application of principal curves was implemented by a group of l'!:e()de~tlc engineers at the Stanford Linear Center (SLAC) in the software Jer()me l?'riedm'l.n of SLAC.
coltidE~S two
by
Intense studied by
nar'tirllp
cetnsion are recorded
pJtlYS1:ClSl;S, is to discover new su[)atOIJWC to accelerate a pOl,itI'on
in a coilision chamber
an etectron
cotnder arcs
to
to
17
Each of the two collider arcs contain roughly 475 magnets (23 segments of 20 plus some extras), which guide the positron and electron beam. Ideally these magnets lie on a smooth curve with a circumference of about 3km as depicted in the schematic The collider has a third dimension, and actually resembles a floppy tennis racket. This is because the tunnel containing the magnets goes underground (whereas the accelerator is above ground). Measurement errors were inevitable in the procedure used to place the magnets. This resulted in the magnets lying close to the planned curve, but with errors in the range of :I:: 1.0mm. A consequence of these errors was that the beam could not be adequately focussed. The engineers realized that it was not necessary to move the magnets to the ideal curve, but rather to a curve through the existing magnet positions and smooth enough to allow focused bending of the beam. This strategy would hopefully reduce the amount of magnet movement necessary. The principal curve procedure was used to find this curve. The remainder of this section describes some special features of this simple but important application. Initial attempts at fitting curves used the data in the measured 3 dimensional geodetic coordinates, but it. was found that the magnet displacements 'Vere small rela,tiveto
th~
bias inqucedby smoothing.
The theoretical arc was then "removed", and subsequent curve fitting was based on the residuals. This was achieved by replacing the 3 coordinates of each magnet by three new coordinates:
1) the arc length from the beginning of the arc till the point of projection onto the ideal curve (x),
2) the distance from the magnet to this projection in the horizontal plane (y), and 3) the distance in the vertical plane (z). This technique effectively removed the major component of the bias, and is an illustration of how special situations lend themselves to adaptations of the basic procedure. Of course knowledge of the ideal curve is not in other applications. this application. The fitted a vertex further it is
was a nal;ur;u way of cnoosmg curve, once transformed at
ha:rdE~r
to
original coordinates, can be represented by a segments is vital
imnnrt:::J!.nrp
se~:mt~nt
since
it is td.latmch if Oi measures magnet specify a threshold
next
withotlt hittj:ng no smoothing at
all results in no
movement (no work), but with many magnets violating the threshold. As the
amount of smoothing (span) is increased, the angles tend to decrease, and the residuals and thus the amounts of magnet movement increases. The strategy used was to increase the span until no magnets violated the angle constraint. Figure llb gives the fitted vertical and horizontal components of the chosen curve, for a section of the north arc consisting of 149 magnets. This relatively rough curve was then translated back to the original coordinates, and the appropriate adjustments for each magnet were determined. The systematic trend in these coordinate functions represent systematic departures of the magnets from the theoretical curve. It turned out that only 66% of the magnets needed to be moved, since the remaining 34% of the residuals were below 60 J.Lm in length and considered negligible. There natural constralIlts on the syste.m. Some of the magnets were ftxedby design, and
thus could not be moved. The beam enters. the arc parallel to the accelerator, so the initial magnets dono beniling. Similarly there are junction points at which no bending is allowed. These constraints are accommodated by attaching weights to the points representing the magnets, and using a weighted version of the smoother in the algorithm. By giving the fixed magnets sufficiently large weights, the constraints are met. Figure llb has the parallel constraints built in at the endpoints. Finally, since some of the magnets were way off target, we used a resistant version of the fitting procedure. Points are weighted according to their distance from the fitted curve, and deviations beyond a fixed threshold are given weight O.
7.2. Gold assay pairs
A California based company collects computer chip waste in order to sell it for its content of gold and other precious metals. Before bidding for a particular cargo, the company takes a sample to estimate the gold content of the the lot. The Qne sub..s;:l.mple is assayed by an outside laboratory, the other by their own inhouse laboratory. The company wishes to eventually use only one of the assays. It is in their interest to know which laboratory produces on average lower gold content assays for a given sample.
data
:l:ji
12a consists
assays. Each point represents an
oh';PY'V:l.. tiC1ITl
=
i=
1 corresponds to oz
even scatter
were many more
than larger ones (> 100z per ton)). Our model for the above data is
(:::) =
e~)) + (:::)
eji
(9)
where Ti is the expected gold content for sample i using the inhouse lab assay, f( Ti) is the expected assay result for the outside lab relative to the inhouse lab, and with Var(eli) = Var(e~2i) is measurement error, assumed i.i.d.
'V i,
This is a generalization of the linear errors in variables model or the structural model (if we regard the Ti themselves as fixed): (10) Model (10) essentially looks for deviations from the 45° line, and is estimated by the first principal component. Model (9) is a special case of the principal curve model, where one of the coordinate functions is the identity. This identifies the systematic component of variable
X2
random variables), or the filnctional model (if the Ti ·<iXe considered
with the arclength parameter.
Similarly we estimate ·. (9) using a natural variant. of the principal curve algorithm. Illthesmoothin~ step we smooth only
Xl
against the current value of T, and then update
T
by projecting the data onto
the curve defined by (I( T), T). The dashed curve in figure 12a is the usual scatterplet smooth of
Xl
against
X2
and is clearly
misleading as a scatterplot summary. The principal curve lies above the 45° line in the interva!l.4 to 4 which an untransformed of valuable. bend in the curve is real, or whether the same we and see
assay tends to be lower than that of the outside lab. The difference is reversed at lower levels, but this is of less practical importance since at these levels the lot is
A natural question arising at this point is whether
linear model (10) is sinlply calculate
If we
curves
access to more data
absence of \Ve COIn'Plllte as essentrauy onto a
straIl2~ht
ad,lltion.a! sampies, we use the bootstrap
residual vectors
to simulate
(9). We therefore scale them up by a factor of V2. We then sampled with replacement from this pool, and reconstructed a bootstrap replicate by adding a sampled residual vector to each of the fitted values of the original fit. For each of these bootstrapped data variance. Figure 12b shows the errors in variables principal curves obtained for 25 bootstrap samples. The spread of these curves give an idea of the variance of the fitted curve. The difference between their average and the original fit estimates the bias, which in this case is negligible. Figure 12c shows the result of a different bootstrap experiment. Our null hypothesis is. that the relationl'lh.ip is linear, we way as above, but replacing the principal curve by the linear errors in variables line. The observed .curve (thick solid curve) lies outside the band of curves fitted to 25 bootstrapped data sets, providing additional evidence that the bend is indeed real.
7.3. One dimensional color data
the full curve fitting procedure was applied
and the fitted curves were saved. This method of bootstrapping is aimed at exposing both bias and
The data in this example were used by Shepard and Carroll (1966) (who cite the original source as
Ii
Boynton and Gordon (1965» to illustrate a version of their parametric data representation techniques called proximity analysis. We give more details of this technique in the discussion.
The data set consists of 23 observations on four variables. It was obtained by exposing 100 observers to light of each of 23 different wavelengths. The four variables are the percentages of observers assigning the color descriptions blue, projection of the data and the principal components. The observations lie very dose to functions of the
sma.u..~r
the plane spanned by the first two linear principal curve, which provides an of summary of the of
data set. (Note that this cannot be deduced from figure 13a alone.) Figure 13b shows the coordinate curve. Arc
int;erl~stjinlZ: feature
curve turns out to be a monotone two
of the curve is that the d.ls:ta.Jlce between
than extreme and the mrdare.
two extremes of the seectmm
are judged more
21
8. Extension to higher dimensions  principal surfaces
vVe have had some success in extending the definitions and algorithms for curves to cater for two dimensional (globally parametrized) surfaces. A continuous 2dimensional globally parametrized surface in IRP is a function
.Ii S;;;
I =.Ii ;. IRP for
m.2 ,
where
I is a vector of continuous functions:
h(>.t,)..2)
I(A) =
12(>\1, )..2)
(11)
II Insert figure 14 about here
Let X be defined as before, and let over A S;;;
II
I denote a smooth 2dimensional surface in JRP, parametrized
m.
2·
Here the projection index Af( m) is defined to be the parameter value corresponding to
the point on the surface closest to m. The principal surfaces of hare those members of (;2 which are self consistent:
E(X IAf(X)
= A) = I(A) for a.e.
A
Figure 14 illustrates the situation. We do not yet have a rigorous justification for these definitions, although we have had success in implementing an algorithm. similar to the curve algorithm; two dimensional surface smoothers are used instead of one dimensional scatterplot smoothers, vVe refer the reader to Hastie (1984) for more details of principal surfaces, the algorithm to compute them and examples.
9. Discussion
Ours is not the case att,eml:>t at fiIld.ing a metnca approaches to
l:1"l>·J:l.l:&,n
fittiJrlg ncnlinear maIuioi,ds to
mulltiV<!Lriaj~e
problem we will restrict ourselves to one dimensicnal mamtotds,
paper. to ours was
su~~ested
22
where peA) is a vector of polynomials PiCA) find the coefficients of
=
apeAk of prespecified degrees ](i' The goal is to
polynomials and the Ai, i
= 1 ... n
minimizing the loss function L lIeill2.
algorithm makes use of the fact that for given Al ... An the optimal polynomial coefficients can be
found by linear least squares, and the loss function thus can be written as a function of the Ai only. Carroll was able to give an explicit formula for the gradient of the loss function, which is helpful in the
n dimensional numerical optimization required to find the optimal A'S.
The model of EtezadiAmoli and McDonald (1984) is the same as Carroll's, but they use different goodnessoffit measures. Their goal is to minimize the offdiagonal elements of the error covariance m<:WrJ.x
:r: = Et E, which is
in the spirit. of classical linear factor analysis. Various measures for the
ond.lagion.;;w' etements are suggested,·for example Li:pi uri' Their algorithm is similar to ours in that it· alternates betwef!n impl."ovingthe A'S for given polynomial coefficients; and finding the optimal polynomial coefficients for given A'S. The latter is a linear least squares problem, whereas the former constitutes one step of a nonlinear optimization in n parameters. Shepard and Carroll (1966) proceed from the assumption that the P dimensional observation vectors lie exactly on a smooth one dimensional manifold. In this case it is possible to find parameter values AI ... An such that for each one of the P coordinates, Xii varies smoothly with Ai. The basis of their method is a measure for the degree of smoothness of the dependence of Xii on Ai. This measure of smoothness,smnmedoverthep.coordinates, optimized with respect to the A's oI1eftnds
those values of Al ... An which make the dependence of the coordinates on the A's as smooth as possible.
We will not go Into the definition and motivation of the smoothness measure; it is quite subtle,
and we refer the interested reader to the original source. We just wish to point out that instead of optimizing smoothness, one could optimize a combination of smoothness and fidelity to the data as described in section 5.5, which would lead to modeling the coordinate functions should allow the method to better deal with noise in the data.
as spline funCtions aIld
In view of this previous work, what do we think is the contribution of the present paper?
· From the operational mal A's
tir'"",,,,
of view it is advantageous
there is no need to specify a parametric the optiattractive
form for the cOJrdlin;a.te tunettens, Because the curve is represented as a ceordmate mactioas is easy. of pnncipar curves to data sets.
·
agrees a summary.
prmcrpat curves as conditional sxpectatrons
tion of linear principal components. This dose connection is further emphasized by the fact that linear principal curves are principal components and that the algorithm converge to the largest
principal component, if conditional expectations are replaced by least squares straight lines.
Appendix  Proofs of Propositions.
Throughout the appendix we make the following assumptions:
Assumptions
I denote a smooth (COO) unit speed curve in lR!' parametrized over a closed, possibly infinite interval A ~ m . We assume that I does not
Denote by X a random vector in lR!' with density h and finite second moments. Let
1
intersect itself (Al# A2
I(Al) :J: I(A2)) and has finite length inside any finite ball.
No.te: Under these conditioIls, the set {/(A), AE A} forms a smooth, connected ldimenslonal manifold diffeomorphic to the interval A. "'ny smooth, connected Idimensional manifold is diffeomorphic either to an interval or to a circle (Milnor, 1965). The results and proofs below could' be slightly modified to cover the latter case (closed curves). Existence of the projection index. Existence of the projection index is a consequence of the following two lemmas. Lemma 5.1: For every z E lRP and for any r
> 0 the set Q = {A I liz  I(A)1I :::; 1'} is compact.
Proof:>Q is closed, because liz  I(A)II is a continuous function of A. It remains to show that Q is bounded. Suppose it were not. Then there would exist an unbounded monotone sequence Al' A2,'" with liz  f(Ai)1l
~
1'. Let B denote the ball around z with radius 21'. Consider the segment of the curve between I(Ai) and I(AH1)'
The segment either leaves and reenters B, or it stays entirely inside. This means that it contributes at least min(2r, IAi+!  Ail) to the length of the curve inside B. As there are mfimteiy sequence
I would have infinite length in B, which is a contradiction.
0
Lemma 5.2: For every z E lR!' there exists A E A for which
liz I(A)II = PEA liz inf
Proof: Define r attained. 0
1(Il)ll
=
B = {Il I Since B is nonempcy and compact ,>",aMao.
IIz
=
side is
=
24
=
is
Proof:
P , ,,~ fU\)"
0
=d(~, f)}
ll(~)
5.2) and COll'lpaet
(l!~mnla
and therefore
has a ''''''''0''''''' element.
It is not hard to show that
is measurable; a
is available upon request.
Stationarity of the distance function. We willfust establish some simple facts which are of interest in themselves. Lemma 6.1: If f(lo) is a closest point to ~ and lo E A0 , the interior of the parameter interval, then :z: is in the normal hyperplane to f at f(lo):
(~  f(lo), f'(.>.o»
=0
defined (lo
point
~
is called an ambiguity point for a curve fif it has more than one closest point on
the curve: cardp '''~  f(l)1l measure O.
= d(~,f)} > 1.
Let A denote the set of ambiguity points. Our next goal will be to show that A is measurable and has Define M A, the orthogonal hyperplane to f at l, by
MA = {:z: '(:z:  f(l), f'(l» = O}
point 1vfA' It will be useful to 1 into UA M . Choose p  1 smooth vector fields nl(l), ... , np_l(l) such mapping that maps A x IRfA
define
that for every l the vectors f'(A), nl().),···, np_l(l) are orthogonal. It is well known that such vector fields do exist. Define X : A x
IRr 1 ....... IRP by
X(l, v)
= f(l) +
and set M
= x(A, :mr 1 ) , the set of all points in IRP lying in some hyperplane for some point on the curve. The mapping X is smooth, because f and nl, ... ,np_l are assumed to be smooth.
\Ve shall now present a few observations which simplify showing Lemma 6.2: peA () Proof:
A has measure O.
= O.
Accordin,g to lemma 6.1, this is posrstble if A is a set of of this measure 0 interval forms a is measurable and has
ayperplane, which has measure O.
measure O.
0
set. It is sufficient to show for
Lemma 6.3: Let E be a measure
E
=
25

Proof:
open covering {Nez) Iz E JR..!' \ E} of JRP \ E contains a countable covering {N;}, because the
topology of JR.P has a countable base.
0
Lemma 6.4: \Ve can restrict ourselves to the case of compact A.
Proof: Set An
= An [n, nJ, In = 1/An, and An the set of ambiguity points of In. Suppose z is an ambiguity
point of Ij then {A I liz  I(A)l1
= d(z,/H
is compact (lemma 5.1). Therefore
z
E An for some n, and
Aclh An·
0
\Ve are now ready to prove Proposition 6: The set of ambiguity points has measure O. Proof: We
can
restrictollrselvesto the case of cOUlpact A (lemma 6.4). As peA
nu'} = 0 (lemma 6.2) it
is sufficient to show that for every z EM, with the possible exception of a set C of measure 0, there exists a neighborhoodN(z) with peA nN(z»
= o. =p
We choose C to be the set of critical values ofX. (A point y E M is called a regular value ifrank(x'(z»
for all z E Xley); otherwise y is called a critical value). By Sard's theorem (Milnor, 1965), C has measure O. Pick z E M nCc. We will first show that Xl(z) is a finite set {(Al,Vl),···,(Ak,Vk)}. Suppose on the contrary that there was an infinite set {(';1,Wl),(6,'W2),"'} with X(';i,'Wi) continuity of X, there would exist a cluster point ';0 of {';lJ6,"
= z.
By compactness of A and
oJ
and a corresponding 'Wo with
x (';0, 'Wo) =
z.
On the other hand, z was assumed to be a regular value of x, and thus X would be a diffeomorphism between a neighborhood of (AO''WO) and a neighborhoodof.z.This isa contradiction. Beca,l1se zlsa regular value, there are neighborhoods Li(Ai, Vi) and a neighborhood N(z) such that X is a diffeomorphism between L; and N. Actually, a stronger statement holds. We can find N(z) C N(z) for which XleN)
c
U~ Lj. Suppose this were not the case. Then there would exist a sequence
Zb Z2,'"

z and
corresponding (';i,Wi) ~ U:=lLi with X(';i,'Wi) continuity x(.;o, 'Wa)
= Zi.
The set {';1,6,··}has a cluster point.;o ~ UL;, and by
= z, which is a contradiction.
y E
exists exactly one pair (A;(y),V;(Y» E L; with X(A;(y),V;(Y»
= y, and A;(y) is a smooth function of y. Define Ao(Y) = Am;n, Ak+l(Y) = Amaz. Set d;(y) = lIy  I(A;(y»1I 2 · A simple calculation using the chain rule and the fact that (y  I(A;(y», 1'(A;(y))} = 0 (lemma 6.1) shows that
grad(d;(y»
= 2(y I(Ai(Y»). A point y E N(z) can be an ambiguity point only if y E Aii for some i =f; j
= {':E
lai(.:)
where
=
curve 1(>') was assumed not to intersect
However. for
smootn,
P<:)SSlOlY
 dj
::/: 0, because
not connected, manifold of dimension p  1, which has measure 0, and
Note: We
anl'a,rs extend
tedmi<::ality: Sard's theoreltn reenues
an
Howe'ver. we
f in
a smooth
oouneanes of the interval.
26
In the following, let
/lg'(.\)/I::5 1. For
ball and for t
9
gB denote the class of smooth curves parametrized over A, with /lg(.\)11 ::5 1 and E gB, define f~('A) = f(.\) +tg(.\). It is easy to see that f~ has finite length inside any finite
< 1, .\ Ii is well defined. Moreover, we have
Lemma 4.1: If Z is not an ambiguity point for
t,
then
Proof: We have to show that for every e > 0 there exists 6 > 0 such that for all t < 6, 1.\1.(z)  Af(z)1
C
< e. Set
= An(.\f(z) Z
e,.\j(z) + e)e and de
= inf>"Ec liz 
f(.\)II. The infimum is attained and de
>
liz  f(.\j(z))lI,
because
is not an ambiguity point.
l~f (/lz 
ft(.\)/IIlz 
ft(.\f(z»ID ?: de  6 lIz

f(.\j(z))/I  6
=6>0
o
We are now ready to prove: Proposition 4: The curve f is a principal curve of h iff
dD2(h, dt
It)l
t=O
0 
v 9 coB· sE v
Proof: Vie use the dominated convergence theorem to show that we can interchange the orders of integration and differentiation in the expression
We need to find a pair of integrable random variables which almost surely bound
Zt
= IIX 
ft(.\I.(X»1I
2 
/IX 
f(.\f(X»1I
2
t
for all SUIUC1,ent,ly smaIl t Now
> O.
,t;xJpaI1ldiIlg the first norm we
=
27
 2t
and thus
(13)
Using the CauchySchwarz inel'!uality and the assumption that
lIgU :::5 1
Zt :::5 2!!X  f(Aj(Xn!1
+1
IIX f(Ao)lI,
and therefore Zt
:::5 211X  f{Ao)1I
'Vt < 1 and arbitrary AO As
Similarly we have
+1
IIXII was assumed to
be integrable, so is
Expanding the first norm
we get
Zt 2: 2 (X  f(Aft(X»)g(Aj,{X»
(14)
2: 211 X  f{Aj,{X»1I
2: 21lX  f{..\o) II ,
which is once again integrable. By the dominated convergence theorem, the interchange is justified. However, from
(13) and (14), and because f and g are continuous functions, we see that the limit limt _
O
Zt exists
whenever Aft (X) is continuous in t at t this limit is given by
= O. We have proved this continuity for a.e.
:ll
in lemma 4.1. Moreover,
by (13) and (14). Denoting the distribution of Aj{X) by h>., we get
If f{A) is a principal curve of h, then by definition E(X IAJ{X)
= l) = f(A) for a.e.
A, and thus
~D2(h, ft)l.
Conversely, suppose that
= 0 va E gB#=0
= A).
for all g E
=0
as
gB. l.,;orlSlder each coordinate separately, and reexpress
unpties that
o
28
Construction of densities with known principal curves. Let
f be parametrized over a compact interval A. It is easy to construct densities with carrier in a tube around f, for which f is a principal curve.
Denote by B,. the ball in IRPl with radius r and center at the origin. The construction is based on the
following: Proposition 7: If A is compact, there exists r> 0 such that X IA x B,. is a diffeomorphism. Proof: Suppose the result were not true. Pick a sequence r.  O. There would exist sequences (Ai, Vi) with
i: (ei, 'Wi)
111.1.11:5 ri, 1I'W.II:5 ri and
x(A., Vi) = X(ei,'Wi)'
The sequences Ai> ei have duster points Ao, eo. We must haveAo = eo, because
fCeo) = Xceo, 0)
= '·'00 X(~i, Vi) .lim
= X(AO' 0) =f(Ao),
and by assumption f does not intersect itself. So there would be sequences (Ai, Vi) and
(~i, 'W')
converging to 0 and
(AO' 0) with X(Ai' Vi)
= X(ei, 'Wi).
However, it is easy to see that (AO, 0) is a regular point of X, and thus maps a
neighborhood of (AO, 0) diffeornorphically into a neighborhood of f(AO), which is a contradiction. Define T(f,r)
X(A x B,.). Proposition. 7 assures that there.a:reno
amLbi~:uij;y
Aj(;r;)
= A for ;r; EX(A,B,.).
= O. The mapping X carries the product
Pick a density (A) on A and a density 1/;(1.1) on B,. with fB,. 1.11/;(1.1)
density (A)' 1/;(1.1) on A x B,. into a density h(;r;) on T(f, r). It is easy to verify that f is a principal curve for h. References
389, Institute for
Mathematical Studies in the Social Sciences, Stanford University, California. Becker, R., Chambers, J. and Wilks, A. (1988), The New S Language, Wadsworth, New York. Boynton, R.M. and Gordon, J.
«
ljeZoldjjrtlke Hue Shift Measured
ColorN'mling TechIliq1ue" ,Journal
of the
Carroll, D.J.
of America, 55 ,
~P()lYIlonlial Factor
«
Al1laljrSU.", Proceedings of the 77th Annual C01nvent:ion, APA, 103104.
vi1eig;htt:~d Re~'ession
Robust lJu,c;auy
and Smoothmg Scatterplots"
~lIH''Nl.,''1
of the
American Statisti.cal.t!sso,::at:on, 74,
<:S"l;(j;)O.
29
Efron, B. (1982), "
Jacknife, the Bootstrap and other Resampling Plans", SIA}dCBlY/S, 38.
EtezadiAmoli, J. and McDonald, R.P. (1983), "A Second Generation Nonlinear Factor Analysis", PsychometriJ:a,
48, #3, 315342.
Golub, G.H. and van Loan, C. (1979), " Total Least Squares", in Smoothing Techniques for Curve Estimation, Proceedings, Heidelberg, Springer Verlag. Hart, J. and WehrIey, T. (1986), "Kernel Regression Estimation Using Repeated Mea surement Data", Journal
of the American Statistical Association, 81, 10801088.
Hastie, T.J. (1984), " Principal Curves and Surfaces", Lab. for Comp. Stat. tech. rep. Statistics Departmen,t, Stanford University.
# 11 and Ph.d. thesis,
Milnor, J.W. (1965), Topology from the Differentiable Vie'Wpoint,University Press of Virginia, Charlottesville. Shepard,. R.N. and Carl'oll, D.J. (1966), " Parametric Representations of Non Linear Data Structures", in
Multivariate Analysis (Krishnaiah, P.R., ed), Academic Press, New York.
Silverman, B.'VV. (1985), "Some Aspects of Spline Smoothing Approaches to NonParametric Regression Curve Fitting", Journal of the Royal Statistical Society B, 47, 152. Stone, M. (1974), "Crossvalidatory Choice and Assessment of Statistical Predictions (with Discussion)", Journal
of the Royal Statistical Society B, 36, 111147.
Thorpe, J.A. (1979), Elementary Topics in Differential Geometry, SpringerVerlag, New York. Undergraduate Text in Mathematics. Wabba, G. and Wold, S. (1975), " A Completely Automatic French Curve: Fitting Spline Functions by Crossvalidation" ,Communications in Statistics, 4, 17. Watson, G.S. (1964) "Smooth Regression Analysis", Sankya Series, A, 26, 359372.
FIGURE 1. (a) (b)
regression line
the sum of squared deviations in the response
principal component line minimizes the sum of squared deviations in the sum of squared deviations in the response
variables. (c) The smooth regression curve
variable, subject to smoothness constraints (d) The principal curve minimizes the sum of squared deviations in all the variables, subject to smoothness constraints. FIGURE 2. The radius of curvature is the radius of the circle tangent to the curve with the same acceleration as the curve. FIGURE 3. Each point on a principal curve is the average of the points that project there. FIGUl:tE 4·.Themean of the observations projecting onto an endpoint of the curve can be disjoint the rest of the curve. FIGURE 5. Selected iterates of the principal curve procedure for the circle data. In all the figures we see the data, the circle from which the data is generated, and the current estimate produced by the algorithm. (a)The starting curve is the principal component line, with average squared distance
D 2(j:(O = 12.91. (b) Iteration 2: D2(j:(2)) = 10.43. (c) Iteration 4: D2(f(4)) = 2.58. (d) Final ))
iteration 8: D2(f(8)) = 1.55. FIGURE 6. These schematics emphasize the iterative nature of the algorithm. The curve of the the :first iteratiqnisa function ofX(O).rneasuredalongtheistarting second iteration is a function of
A(I)
curve of the the
measured along the curve of the first iteration (b).
FIGURE 1. (a) The final curve in figure 6 with one more smoothing step, using crossvalidation separately for each of the coordinates  D2(f(9)) the iterations using crossvalidation at every step. FIGURE 8. The points represent three dimensional data generated from a helix with gaussian errors. The dashed curve is the generating helix, and the solid the curve fitted to the data. All are projected onto a plane chosen to give a reasonable view of the results. FIGURE 9. Some curves preaueed by curve found stctrti,nl:!: is
<,"'U·"''''\L.
= 1.28.
(b) The curve obtained by continuing
algorithm applied to biv.lXiate spinerlcaJ. gau,ssict.Il
(a)
circle
aJg;OrithlJ:l is started at a or a but
centered at
mean.
a smcotner ensures
(c) The curve found
is genersrec location on is seieceec unnornuy
31
errors.
than the generating curve. 11.( a) A rough schematic of the Stanford linear accelerator and the linear collider ring. (b) fitted coordinate for the positions for a section of SLC. The data represent residuals from the theoretical curve. Some (35%) of the deviations from the fitted curve were small enough that these magnets were not moved. FIGURE 12. (a) Plot of the log assays for the inhouse and outside labs. The solid curve is the principal curve, the dashed curve the scatterplot smooth. (b) A band of 25 bootstrap curves. Each curve is the principal curve of a bootstrap sample. A bootstrap sample is obtained by randomly assigning to the principal curve solid curve, curve). The bandof curves appears to be spread ofthe curves gives an indication of
varia:nce. (c) Another band of 25 bootstrap curves. Each curve is the principal curve of a bootstrap sample, based on the linear errors in variables regression line (solid line). This simulation is aimed at testing the null hypothesis of no kink. There is evidence that the kink is real, since the principal curve (solid curve) lies outside this band in the region of the kink. FIGURE 13. (a) The 4 dimensional color data and the principal curve projected onto the first principal component plane. (b) The estimated coordinate functions plotted against the arc length of the principal curve. The arc length is a monotone function of wavelength. FIGURE 14. Each point on a principal surface is the average of the points that project there.
32
If,
Ie
o
o
0
o o
o
o
o
0
o o o
o o
·0
o
I_ .
_ _ _ _1
3
e.
'.
.
,
..
· ·
· ·· · · · · · · ·· · · · · · · · · .' · · · · · ·· " · ··· · · ·· · · · · ·· · · · · ·· · · ·· · · · · · · ·· · · t · · · ·· · ·
..
.
..
..
,
,
· · · · · · · ·· · ·· · · · ·· · · · ·· · · ,
.....
· · · . 1· · · ·· ·

·
· · · · · ·
"
..
, ··· ·
·· ·
"[
·
start: PC line  12.91
iteration 1:  10.43
· · · ··
.. ·
·
·
· ·
· · · · · · ·· ·· "
·
..
·
·
·
·
·
· · ··
· ·
·
..
· ··· ·
· · ·
· · ·
· · ·
iteration 4:  2.58
final  1.55
:'
· · · ·
· · · ·
·
·
·
CV final fit  1.28
CV'd iteratively:  .12
·
· ·
·
·
· ·· ·
· · · · ·
·
··
· ·
·
·
·
·
·
··
·
·
·
·
·
· ··
·
8
·
··
·
·· ··
·
.. .
· ·· ··,.
· · ·
· ·· ·
· · ·· ·· ·· .,. · · ·
,
· ··
· ·· · · ·· ··
til
· ··
· .. ··
·
· · ·
I
·
· ·
· ·
·
·
·
:.:::6· ..
·
· ·· · ... ···· .... ·· ... .
'II ··
· ·
·
·
·
~:.
··
·
· ·
... ...
... ...
... ...
"...
...
"
, ,
I
f f f
f f I
I I I I
I
I
I
I
I
...... '" '"
'"
, ,,
,
, ,,
I
I
10
collision chamber
I
linear accelerator
II
o,
o
100
200
arc lengltl (m)
'.
I
H
..
.
..
'
.
. .. ,;,=...._........
.'
'
..
"l
. 
o
100
200
arc lengltl (m)
fib
=
,...
o
o
1
2
3
4
inhouse assay (a)
o
I
I
a
2
inhouse assay
(c)
4
6
12 b
o
,',"
","
" "
.."
" .. "
o
2
inhouse assay (b)
4
6
1:& c
First principal aXIS
60
rJ)
\
o .....
...,J
~
'red \ yellow ....
~eep.
~ 40
Q)
...,J
o
,
.
/
...
I
\
\
/
\
/\
I
/ /
X
\
.\ \
.8
H
a:s
. \
\ \
J
I '. I
"C
f20
o o
"C
Q)
...,J
I
/
,:
. \
...,J rJ)
S .....
r:i!
a:s
0
_
/
(I)
I
.
\
...........:...;.._ ......
0 0 0 0 0 0
"  . _.
"'" F .  .  .""'.....
..,.",
.
.......:
.
00 0
00
co
0
o
000
CD
o
50
100
150
200
250
Estimated A (wavelength)
1'3 b
·
·
·
· ·
· ·
f(X) ·
·
·
·
14
Information
52 pages
Report File (DMCA)
Our content is added by our users. We aim to remove reported files within 1 working day. Please use this link to notify us:
Report this file as copyright or inappropriate
1226049