#### Read Morris_Lock_NEF-QVF_longer.pdf text version

Unifying the Named Natural Exponential Families and their Relatives

Carl N. Morris and Kari F. Lock Harvard University August 4, 2008

Abstract Five of the six univariate natural exponential families (NEF) with quadratic variance functions (QVF), meaning their variances are at most quadratic functions of their means, are the Normal, Poisson, Gamma, Binomial, and Negative Binomial distributions. The sixth is the NEF-CHS, the NEF generated by convolved Hyperbolic Secant distributions. These six NEF-QVFs and their relatives are unified in this paper and in the main diagram, Figure 1, which connects NEFs with many other named distributions, including Pearson's families of conjugate distributions (Inverted Gamma, Beta, F, and Skewed-t), conjugate mixtures (including two Polya urn schemes, with types I and II sampling), and conditional distributions (including the Hypergeometric and Negative Hypergeometric). Limit laws that relate these distributions are indicated by solid arrows in Figure 1.

Keywords: Normal, Poisson, Gamma, Binomial, Pearson families, quadratic variance functions

1

INTRODUCTION

All statisticians appreciate that the Normal, Poisson, Gamma, Binomial, and Negative Binomial distributions reach powerfully into every realm of theoretical and applied statistics. These distributions are five of the six natural exponential families (NEFs) that have quadratic variance functions (QVF), i.e. the variance is at most a quadratic function, V(µ), of the mean µ. Figure 1 shows the six NEF-QVFs in red ellipses, with arrows that connect them to related distributions. Section 2 introduces NEFs as special exponential

Carl. N. Morris ([email protected]) is Professor of Statistics, and Kari F. Lock ([email protected]) is a doctoral student, both of the Department of Statistics, Harvard University, Cambridge, MA 02138. We are grateful for support from Harvard's Clark Fund. We thank Cindy Christiansen for her contributions to a 1988 version of Figure 1, and Cindy and Joe Blitzstein for their valuable discussions.

1

Figure 1: The six NEF-QVF distributions (in ellipses), each with its conjugate (rectangles), conjugate mixture (hexagons), and

conditional (octagons). Limit laws are portrayed with solid arrows. Complexity (number of parameters)

1 parameter

2

3

4

5 parameters

d, b

GENERAL:

NEF-QVF

Conditional

^ r2V ( µ ) ^ ^ Y1 | µ ~ µ , r1 (r1 + r2 + v2 )

V (µ ) Y | µ ~ µ, r

Yi ~ NEF[ µ , V ( µ ) / ri ], ind. ^ µ ( rY1 + r2Y2 ) / (r1 + r2 ) 1

Sample Space details: discrete (d), continuous (c) bounded (b), semi-bounded (s), unbounded (u) Family properties: infinitely divisible (i)

Hypergeometric NEF 4

d, b

Binomial

Conjugate

V (µ0 ) µ ~ µ0 , r0 - v2

(r + r0 )V ( µ0 ) Y ~ µ0 , r (r0 - v2 )

Mixture

NEF 2

d, s, i

Poisson

c, b, i

Beta

= Neg HG

Polya I

d, b

Variance Function : V ( µ ) = v2 µ 2 + v1µ + v0

NEF 1

c, u, i

NEF 3

c, s, i

NEF 5

d, s, i

Constant LLN

Normal

CLT

Gamma

Neg Bin

c, s, i

Conjugate: The density proportion to the NEF-QVF likelihood when viewed as a function of the mean Mixture: Y | µ ~ NEF-QVF, µ | ~ Conjugate => Y | ~ Mixture Conditional: Y1 , Y2 ~ NEF-QVF, Y1 , Y2 ind. => Y1 | (Y1 + Y2 ) ~ Conditional

Inverted Gamma

c, s, i

d, s

F

Polya II

c, u, i

NEF 6

c, u, i

?

NEF-CHS

Special Cases: Exponential, Chi-Square, Erlang, Symmeterized Laplace (Gamma); Bernoulli (Binomial); Geometric (Negative Binomial); Hyperbolic Secant (NEF-CHS); Uniform, Arcsin (Beta); t, Cauchy (Skew-t) EF's (nonlinear transformations of NEFs): Lognormal (Normal); Generalized Gamma, Weibull, Extreme Value, Pareto, Chi, Power Function, Inverted Gamma (Gamma)

c, u, i

c, u, i

Skew-t

Skew t CHS

PARAMETERS: Location (L), Scale (S), Exponential Family Generation (G), Convolution (C), and Population Size (N)

Table 1: Key facts for the six NEF-QVF distributions. The mean, µ, variance function, V(µ), natural parameter, , and cumulant function, () are given for the elementary distributions (r = 1), Y1. Convolutions (bottom half of the table) are sums of r iid elementary distributions.

families (EFs), and then covers the operations of NEF generation, convolution and division, and linear transformations. Section 3 introduces the variance function (VF), focusing on NEF-QVF distributions. Besides these famous five distributions, the sixth and only other NEF-QVF is the NEF generated by convolutions (including infinite divisors) of the Hyperbolic Secant (HS) distribution ((3.5)), labeled the "NEF-CHS", "C" for convolution. Four types of arrows in Fig. 1 summarize relationships among the six NEF-QVFs and various named univariate distributions. Pearson's families of distributions, in the blue rectangles in Fig. 1 placed directly below each NEFQVF, arise as conjugate (prior) distributions of the six NEF-QVFs (Section 4). Conjugate mixtures (marginal distributions of NEFs) stemming from these Pearson conjugates are in green hexagons to the lower right of each NEF-QVF (Section 5). The purple octagons to the upper right of each NEF-QVF reveal the conditional distributions of one NEF, given the sum of two such independent NEFs (Section 6). Limit laws follow leftward-pointing solid arrows (), with simplified variance functions (Section 7). Fig. 1 and this paper provides an overview of key ideas taken from Morris (1982, 1983), while proofs and other results are left to that paper and other references. Another diagram that relates distributions, by Leemis and McQueston (2008), connects nearly all named univariate distributions. While Figure 1 here includes many of their distributions, i.e. all those that arise as relatives of the six NEF-QVFs, our purpose is to reveal the common structure connecting these distributions. We hope readers will appreciate this powerful glimpse into the beautiful unification of the distributions we all encounter regularly and love to work with.

2

2.1

NATURAL EXPONENTIAL FAMILIES

Defining an NEF

Natural exponential families are a subclass of all exponential families. The distributions of a random variable X form a univariate exponential family (EF) if their densities or probability mass functions (PMF) have the form exp{ A(x)B() + C(x) + D()}, (2.1)

with as the parameter of interest. Special subclasses of these are the natural exponential families (NEF), with Y = A(X) termed the natural observation and Y N EF . Y follows an NEF because y is multiplied in the exponent by (a function of) the parameter of interest. The distributions of X in (2.1) follow an EF, but not an NEF unless A(·) is linear. Some other authors have used the terminologies linear (Patil 1985), and canonical or standard (Brown 1986), as synonyms for natural. To segue into NEF language, define as the natural parameter, and (·) as the cumulant function, where B() and () -D(). Defining dG0 (x)

2

d(eC(x) ), (2.1) becomes P (X B) =

B

exp{ A(x) - () } dG0 (x).

(2.2)

For Y = A(X), this simplifies to the general form characterizing NEFs: P (Y B) =

B

exp{ y - () } dF0 (y) =

B

dF (y).

(2.3)

A univariate NEF is a parametric family of distributions with a random variables Y taking values in a sample space S, with Y satisfying (2.3). The natural parameter, , lies in the natural parameter space, H (the Greek capital eta), a nondegenerate interval containing 0. Taking B = (-, y], (2.3) yields the family of CDFs, F (y), indexed by . Often is a transformation of some more familiar parameter, such as the mean parameter µ. If 0 H (as is assumed here), then F0 is a cumulative distribution function (CDF) with moment generating function (MGF), M0 (t) = exp((t)), and H is the interval on which M0 (t) is finite, but now with t replaced by . In NEF terms, () is the cumulant function for the NEF (not the cumulant generating function of just one distribution) because for NEFs the k th cumulant of Y , Ck for k = 1, 2, ..., is Ck = (k) () = dk () . d k (2.4)

The first two cumulants, the means and the variances, are µ EY = () C1 (µ) C1 , and V (µ) Var(Y ) = () C2 (µ) C2 . (2.5) (2.6)

The means, µ, lie in the mean space (H). The variance function V (µ) is central to NEF-QVF theory, see Section 3. NEFs have several advantages over EFs. Derivatives of () yield cumulants of Y = A(X), the natural observation, but not of X. Convolutions of NEFs follow their own NEFs, with cumulant functions that are multiples of (), but convolutions of members of an EF that is not an NEF generally are complicated. Sufficient statistics with independent EFs are convolutions of the natural observations, Y = A(X), and not of the X. To illustrate, consider Y N (µ, 2 ) with known and fixed. In (2.3), 2 2 = µ/ 2 , () = 2 2 /2 and dF0 (y) = 1 e-y /(2 ) dy, so the distributions 2 of Y form an NEF. Differentiating the cumulant function, () yields the mean () = 2 = µ, and the variance function, with () = 2 = constant. Alternatively, when X is LogNormal, X = exp(Y ), then X follows (2.2) with A(x) = log x, so X is an EF but not an NEF. While the cumulant function, , can be differentiated to yield Normal cumulants, LogNormals do not have cumulant functions and the density function of such a convolution is intractable. 3

2.2

Generating an NEF

Starting with a solitary member distribution of an NEF, all possible distributions within that NEF can be generated via five operations: using linear functions (translations and re-scalings), convolution and division (division being the inverse of convolution), and exponential generation (defined next). Each of the six NEF-QVFs starts with a single generator distribution, taken in Table 1 to be Normal(0,1), Poisson(1), Exponential(1), Bernoulli(1/2), Geometric(1/2), and the Hyperbolic Secant (symmetric with mean 0, variance 1). The first five are widely accepted as the simplest members of each family. Any distribution with an MGF generates an NEF via exponential family generation, EFG, accomplished as follows. Suppose Y0 has CDF F0 and MGF M0 (t) for t H. Now replace t with and denote the cumulant function as log(M0 (t)), letting () log[M0 ()] = log exp{y}dF0 (y), introducing a new (natural) parameter, , the generation parameter. Multiplying dF0 (y) by exp(y - ()) creates (generates) a (parametric) family of CDFs, F as in (2.3), H. We refer to such an exponentially generated family (starting from any generator distribution) as the corresponding elementary NEF. Table 1 lists the six NEF-QVF elementary families, N(µ, 1), Pois(µ), µ Expo(1), Bern(p), Geom(p), NEF-HS(µ). For example, let us start with the generator Poisson distribution, Pois(1), with PMF e-1 /y! on the nonnegative integers, and with MGF M0 (t) = exp(et - 1). Then () = logM0 () = e - 1. Differentiation provides the mean µ = () = e , and so = log(µ) is the natural parameter. Taken with respect to counting measure, (2.3), exponential family generation has expanded the Pois(1) distribution to the entire Pois(µ) family of PMFs. Convolutions (and divisions, whenever possible) of elementary distributions yield another parameter, r > 0, the convolution/division parameter, or more simply, the convolution parameter. If r is an integer, r is the number of convolved elementary distributions. For example, NBin(r, p) is the convolution (sum) of r i.i.d. Geom(p) distributions, and Bin(r, p) is the convolution of r Bern(p) distributions. We then extend r to include division and infinite division whenever possible. This means that infinitely divisible distributions, i.e. for all NEF-QVFs other than Binomials, r can be any positive real number. Convolutions of members within an NEF remain within that NEF, provided the convolved distributions have the same generation parameter . Exponential family generation and convolution/division commute. These operations, applied in either order, produce the same NEF, as Fig. 2 illustrates. Other NEFs are produced via linear transformations. If Y F,r , other NEFs of the same type include all Y a0 ± a1 Y , giving rise to location parameters a0 R, and scale parameters a1 > 0. For example, this allows for Poisson or Binomial distributions with support other than on integer lattices. Since linear operations preserve the quadratic nature of variance functions, we consider the linearly transformed distributions to be part of each originating family. Exponential family generation leaves the support unchanged, so that linear

4

Figure 2: The operations of exponential family generation and convolution/division commute. transformations of the generator distributions usually produce different NEFs. Exceptions are the Normal and Gamma distributions, when the exponential generation parameters coincide with location and the scale parameters. The six NEF-QVF generator distributions all generate different NEFs, and the term NEF type can be used to refer not only to the family generated from the generator distribution, but also to all possible convolutions and linear transformations. Each red ellipse in Figure 1 represents one NEF-QVF type.

2.3

Remarks on NEFs

Exponential family generation, convolutions/divisions and linear and scale transformations give rise to four parameters: generation ( or 1-1 functions of such as µ), convolution/division (r > 0), location (a0 ) and scale (a1 ). Some parameters may play dual roles, as displayed in Table 1. For example, the mean µ in Poisson(µ) serves as both the generation and convolution parameter, leaving the Poisson with three unique parameters, not four. For the Normal, the standard deviation is the scale parameter and its square, the variance, is the inverted convolution parameter (being 1-1 functions of each other, these serve as just one parameter). Each column in Figure 1 is headed by the number of unique parameters: two for the Normal, three for Poissons and Gammas, the next simplest, and four parameters for the remaining three NEF-QVFs. Table 1 provides the natural parameter for each of the six NEF-QVF (perhaps as a function of the mean or of some more familiar parameter). The cumulant function, () is the logged MGF of the generator distribution (replacing t with ). The MGF of the NEF for each is M (t) = exp{ r[( + t) - ()]}, and log(M (t)) is the cumulant generating function (a function of both t and , and not to be confused with the cumulant function). The k th derivative of the cumulant generating function, log(M (t)), evaluated at t = 0 equals the k th derivative of the cumulant function, (), justifying (2.4).

5

If 0 H, the natural parameter space, then F0 may be normalized to be a CDF for = 0. If 0 H, one may either shift the natural parameter space to / contain 0, making F0 a CDF, or instead simply regard F0 as be an increasing monotone function (a Stieltjes measure) not depending on .

3

3.1

NEF-QVFs

Variance Functions

The variance function (VF), V (µ), expresses the variance of a distribution in terms of its mean, µ. By (2.5) and (2.6), dµ d () = = () = V ar(Y ) > 0, d d (3.1)

so µ is 1-1 in , increasing monotonically. Thus V ar(Y ) = () is a function of µ. Cumulants are computable recursively as functions of µ, Ck+1 = dCk dCk dµ = · = Ck (µ)V (µ). d dµ d (3.2)

Denoting V as dV (µ)/dµ, cumulants are expressible in terms of V as C3 = V ·V, C4 = V · V 2 + V 2 · V , and higher cumulants derived via (3.2). Cumulants are convertible to central moments, Mk for k 2, by M2 = C2 , M3 = C3 , k-2 2 M4 = C4 + 3C2 , and for k 4, Mk = Ck + i=2 k-1 Mi Ck-i (Morris 1982, i Sec. 7). Therefore, the variance function and it's domain, (the mean space, which is the convex closure of the sample space), completely determine the NEF via it's MGF (although no specific parameter µ is identified). Note that the VF only characterizes an NEF (not an EF). For example, LogNormals and Gammas have the same VF, but LogNormals are not NEFs. We use the notation Y N EF [µ, V (µ)], with square brackets denoting [mean, variance]. Together with µ , this completely specifies the NEF. Quadratic variance functions control standard deviations in relation to the mean. For NEFs, it ensures that k th cumulants and moments are a polynomial of degree at most k in µ.

3.2

Isolating µ from r

The precise form of a variance function depends on the scaling of µ. Consider iid r Yi Bern(p), and Y i=1 Yi Bin(r, p). While one could define EY = rp µ so that V ar(Y ) = rp(1- p) = µ -µ2 /r V (µ), we prefer EY = rp rµ so the variance rV (µ) with V (µ) = µ - µ2 . In general, we choose to let µ and V (µ) pertain to the elementary distribution's moments, i.e. when r = 1. As only one-parameter NEF-QVFs are considered here, with r fixed (and known), with only the generation parameter unknown, and often it is most convenient to separate r from to isolate the parameter of interest. In order to keep EY = µ 6

for general r, we consider Y = Y /r, with expectation not dependent on r (even though this may change the sample space S by a factor of 1/r). Then Y satisfies dF,r (y) = exp{r[y - ()]}dF0,r (y). (3.3)

When r = 1, this simplifies to (2.3). This isn't a fundamental change, and if the earlier form is more convenient, one can return to (2.3) by reabsorbing the known constant r into , with () altered accordingly. Table 2 gives densities, means, VFs, and cumulants for the parameterizations so that the VFs do the depend on r. Table 2: Densities, means, variance functions, and higher cumulants for NEF convolutions and averages. Averages are extendable to non-integer r. density exp{y1 - ()}dF0,1 (y) exp{y - r()}dF0,r (y) exp{r[y - ()]}dF0,r (y) Mean µ rµ µ VF V (µ) rV (µ) V (µ)/r k th Cumulant (k) () r (k) () (k) ()/rk-1

Elementary, Y1 r Y i=1 Yi , Yi iid Y Y /r

3.3

The Six NEF-QVFs

V (µ) = v2 µ2 + v1 µ + v0 . (3.4)

Quadratic variance functions (QVFs), satisfy

Precisely six NEF-QVFs types exist (Morris 1982, Sec. 4), displayed as red ellipses in Figure 1. The VF for each elementary NEF-QVF is shown in Table 1, and is multiplied by r for convolutions and divided by r when we divide the convolution by r. Figure 1 indicates whether the support of each distribution is continuous or discrete, bounded, semi-bounded (bounded only above or below), or unbounded, and whether or not the distribution is infinitely divisible. Further details for NEF-QVFs are displayed in Table 1. The first five NEF-QVFs serve at the core of statistics, as sampling distributions, but the 6th is largely unknown. These six distributions are numbered 1-6, corresponding to their parameterization dimensions and to increasing sample space complexities. N EF1 . Normal distributions (or Gaussians) are widely used as sampling distributions because of their flexibility, their sampling properties, and the central limit theorem. Normals form the only NEFs with variances not depending on their means. N EF2 . Poisson NEFs count the occurrence of rare events. They serve as the cornerstone for the theory of infinite divisibility which from the perspective

7

on this paper is because the Poisson convolution and generation parameters coincide. N EF3 . Gamma NEFs arise as convolutions and divisions of Exponentials, the latter being famous for their memoryless property. Gammas are denoted by many authors as Gamma(, ). We prefer to avoid the or 1/ scale parameter confusion by using the simpler representational notation Y µ · Gam(r). Gam(r) represents the convolution of r Exponentials, each with mean µ. Then, in our square brackets notation [mean, variance], Y Gam[rµ, rµ2 ]. Special cases of Gammas include Exponentials µExpo(1) µGam(1), Chi-Squares (2 2Gam(n/2)), Erlangs a1 Gam(k), k N, and Laplace n distributions ±µExpo(1). Named exponential families that are nonlinear transformations of Gammas and Exponentials, with G Gamma, and X Exponential, include Generalized Gammas G , Weibulls X , Gumbel's Extreme Value distributions log X, Paretos eX + c, Chi distributions 2 2Gam(n/2), Rayleighs 2 2X and 3 , and Inverted Gammas n 1/G. Cf. Johnson and Kotz (1970) or Evans, Hastings, and Peacock (1993). N EF4 . Binomials, Bin(r, p), count the number of successes in r i.i.d. Bernoulli trials. N EF5 . Negative Binomials, Y NBin(r, p), are convolutions of r i.i.d. Geometrics. These count the number of successes before r failures in successive Bernoulli(p) trials, so Y {0, 1, 2, ...}. More generally, r > 0 need not be an integer, because of infinite divisibility. N EF6 . The NEF-CHS arises from its generator distribution, the Hyperbolic Secant (HS). The HS is a continuous, infinitely divisible distribution, Y R, with a symmetric, bell-shaped density function, mean 0 and unit variance. It has a finite MGF and exponentially decaying tails, with a distributional representation of (2/)log(|Cauchy|) and the hyperbolic secant as its PDF. f (y) = 1 2cosh(y/2) (3.5)

See Johnson and Kotz (1970); Morris (1982), and Manoukian and Nadeau (1988). Convolutions and divisions of this distribution yield all the Convolved Hyperbolic Secant (CHS) distributions. CHS distributions historically have been called the Generalized Hyperbolic Secant (GHS), Harkness and Harkness (1968). However, "convolved" is more descriptive than "generalized", as the latter takes many meanings (e.g. Generalized Gammas are powers of Gammas). Exponential family generation of the CHS produces the NEF-CHS (Morris 1982, Sec. 5). Statisticians lack convenient, usually skewed sampling distributions with support on all reals. The NEF-CHS provides one.

4

4.1

PEARSON CONJUGATES

Conjugate Families

Pearson's families (Johnson and Kotz, 1970; Kendall, Stuart, and Ord, 1987) arise exactly as the conjugate prior distributions for the mean µ of NEF-QVF 8

distributions (Morris 1983, 1988). First, start with a NEF-QVF likelihood function expressed as a function of the natural parameter , but with r0 > 0 and µ0 substituted for the convolution parameter r and the sufficient statistic y respectively in (3.3). Taken with respect to Lebesgue measure on H, this defines a conjugate density given by g1 ()d = K exp{ r0 [µ0 - ()] }d (4.1)

where K = K(r0 , µ0 ) is the normalizing constant. The conjugate density on is d proportional to the NEF likelihood. Changing variables, = d = dµ dµ =

dµ V (µ)

and likewise () =

()d = -r0

µ0

µ V (µ) dµ,

yields the density of µ , dµ . V (µ) (4.2)

µ

g(µ)dµ = K exp

(~ - µ0 )d~ µ µ V (~) µ

When a NEF likelihood is immediately available as a function of µ, a conjugate density with respect to d = dµ/V (µ) simply substitutes µ0 and r0 for y and r, with K chosen so it integrates to 1. For example, starting with Y (1/r)Bin(r, p), the conjugate (prior) distribution replaces y with µ0 (or Y with r0 µ0 ) and r with r0 , and changing the measure from d to dp/V (p), the density on p = µ becomes Kµ0 ,r0 pr0 µ0 (1 - p)r0 (1-µ0 ) dp . p(1 - p) (4.3)

We recognize that µ = p Beta(r0 µ0 , r0 (1-µ0 )) and Kµ0 ,r0 = 1/(r0 µ0 , r0 (1- µ0 )) is a beta function (inverted). Figure 1 exhibits the six (Pearson family) conjugates in blue rectangles directly below each NEF-QVF (follow the dotted arrows). No arrow is shown for the Normal, which is its own conjugate, but it is earmarked by the rectangle inside the Normal ellipse. The conjugate families on µ are: Gammas for Poissons, Inverted Gammas (reciprocals of Gamma random variables) for Gammas, Betas for Binomials, F-distributions for Negative Binomials, and Skewed-t distributions (Skates 1993; Esch 2003) for the NEF-CHS. The symmetric Skewed-t is Student's t, which includes the Cauchy distribution, t1 . We renamed what was originally known as the Skew-t (Skates 1993) as the Skewed-t, since Skew-t has since been taken to mean a Skew-Normal over a Chi, which is different than our Skewed-t, Pearson's Type IV distribution.

4.2

NEF-QVF Conjugates as Pearson Families

Karl Pearson a century ago characterized all continuous distributions for which the derivative of the log density is a linear function divided by a quadratic. Distributions satisfying this criterion later became known as Pearson families. Remarkably, the six NEF-QVF conjugates derived above correspond to precisely all of Pearson's families. Pearson identified and labeled 12 distributions, Pearson Types I - XII, and all his 12 are among the six named distributions in the blue 9

rectangles of Fig 1. The Normal is a Pearson distribution but unnumbered, Gamma is Pearson type III, Inverted Gamma is type V, Beta is type I, F is type VI, Skewed-t is type IV, and the rest of Pearson's distributions are special cases of these six. NEF-QVFs have quadratic V(µ), so with g(µ) as in (4.2), r0 (µ0 - µ) - V (µ) d log(g(µ)) = dµ V (µ). (4.4)

These are ratios of linear functions (in µ) to quadratic functions, i.e. V(µ), which meets Pearson's condition, exactly. To honor Pearson, we refer to these as "Pearson Conjugates" (PC), despite his having derived them as sampling distributions for continuous data, and not as conjugate prior distributions. Pearson recommended fitting his distributions to data by matching the first four sample moments, although this eventually fell out of favor, partly because higher moments fit with data are rather unstable. The means and variances of these Pearson families in the notation here are µ P C µ0 , V (µ0 ) r0 - v2 (4.5)

(Morris 1982, Sec. 5). By (4.4), V (·) characterizes the Pearson family, so (4.5) is unambiguous. Remarkably, the conjugate distribution's VF in (4.5), where V (µ0 ) = v2 µ2 + v1 µ0 + v0 , is the same VF as the NEF. Of course r0 > 0 v2 is required for the variance to exist. The third PC cumulants are C3 = 2V (µ0 )V (µ0 )/[(r0 -v2 )(r0 -2v2 )], a cubic. This and all existing higher Pearson moments are expressible in terms of the VF (Morris 1982, Sec. 5.). Gamma distributions arose here initially as an NEF-QVF. Now they arise again, as NEF-QVF conjugates to the Poisson, and with a different parameterization. As an NEF-QVF, Y (µ/r)Gam(r) and the variance is strictly quadratic in µ. If instead, Y (1/r)P ois(rµ) the conjugate density for µ becomes dµ K(µ0 , r0 )e-r0 µ (r0 µ)r0 µ0 . (4.6) µ The Gamma distribution has arisen again, with µ r1 Gam(µ0 r0 ), but now 0 Var(µ) = µ0 is not strictly quadratic in µ0 . Pearson's conjugate distributions can be used to generalize LaPlace methods and maximum likelihood estimation (MLE), Morris (1988). As does the MLE, this "adjustment for density maximization" (ADM) evaluates two derivatives to derive better moment and distribution approximations.

4.3

Pearson Conjugates as Prior Distributions

Pearson conjugates serve as highly tractable prior distributions on an NEF mean for Bayesian analyses of NEF data because the resulting posterior takes the same form, V (µ ) r0 µ0 + ry y , . (4.7) µ|y, µ0 , r0 P C µ y r0 + r r + r 0 - v2 10

The PC here refers to the same family as that in (4.5), so the VF again agrees with that of the NEF. The conjugate prior also is convenient because the poste rior mean, µy , is linear in y, a result that characterizes conjugates to all NEFs (Diaconis and Ylvisaker, 1979). The "shrinkage factor" B r0 /(r0 + r) dictates the shrinkage of y towards µ0 , with µ (1 - B)y + Bµ0 being the posterior y mean. A useful fact, apparently unknown, is that Jeffrey's prior for an NEF mean, µ, is conjugate (in the Diaconis and Ylvisaker sense) if and only if the VF is quadratic. Thus the Jeffrey's posterior is especially easy to work with if and only if it is a Pearson conjugate distribution. Conjugate priors are easy to work with and sufficiently flexible to allow a range of choices of the mean and variance, via choosing µ0 and r0 in (4.5). In fact, among all distributions with a pre-specified prior mean and variance, the Pearson conjugate provides minimax risk for squared error loss, so Pearson conjugates are the safest and most robust choice of prior (Jackson et al, 1970; Morris, 1983, Theorem 5.5). Walter and Hamedani (1991), Consonni and Veronese (1992), and Diaconis, Khare, and Saloff-Coste (2008) all have studied other aspects of these PC distributions from a unified perspective, as have Gutierrez-Pena and Smith (1997) in a multivariate setting.

5

5.1

CONJUGATE MIXTURES

NEF-QVF Pearson Conjugate Mixtures

Conjugate mixtures are shown as green hexagons in Figure 1, each at the lower right (southeast) of its associated NEF-QVF; follow the green dotted and dashed arrows. A Pearson conjugate mixture refers to the marginal distribution of Y , when Y |µ N EF [µ, V (µ)/r] and µ|µ0 , r0 P C[µ0 , V (µ0 )/(r0 - v2 )], as in (4.5) with r0 > v2 . The marginal means and variances of Y follow directly from Adam's Law (E(Y ) = E[E(Y |X)]) and from Eve's Law (V ar(Y ) = E[V ar(Y |X)]+V ar[E(X|Y )]), E(Y ) = E[E(Y |µ)] = E[µ] = µ0 , and V ar(Y ) = E[V ar(Y |µ)] + V ar[E(Y |µ)] = E[V (µ)]/r + V ar[µ] = E(v2 µ2 + v1 µ + v0 )/r + V ar[µ] = [v2 V ar(µ) + V (Eµ)]/r + V ar[µ] r + r0 V (µ0 ). = r(r0 - v2 ) (5.1) (5.2) (5.3) (5.4) (5.5) (5.6)

Thus, all six NEF-QVFs mixed with their Pearson conjugates have Pearson mixtures (PM) as marginal distributions, Y P M µ0 , r + r0 V (µ0 ) . r(r0 - v2 ) 11 (5.7)

Once again, as also does the Pearson conjugate and the posterior distribution, the PM inherits the VF of the originating NEF. These are shown in Fig. 1 as green hexagons. Normals conjugate mixtures are Normals, but differently parameterizations (arrow suppressed), Negative Binomials are Gamma mixtures of Poissons, F distributions are mixture of Gammas, via their Inverted Gamma conjugates, and Polya I and Polya II distributions (explained next) are mixtures of Binomials and Negative Binomials using their respective conjugates, Beta and F . Mixing NEF-CHS distributions with µ Skewed-t yields continuous and unbounded five parameter distributions, labeled "Skewed-t - CHS" in Fig. 1. Of course each mixture distribution possesses the same support as the NEF from which it arose.

5.2

Polya Distributions

Polya's urn schemes give rise to distributions that arise as NEF Pearson conjugate mixtures with integer parameters. We will adopt the labels I, II here to reflect two types of sampling from Polya's (simplest) urn scheme, sampling from a binary urn, initially with B blue and W white balls, with "double replacement", so after each draw the drawn ball is replaced together with another ball of the same color and the urn size increases after each draw, Feller (1950). Type I sampling stops after a fixed number r balls are drawn, while type II sampling continues until r blue balls have been drawn. The random variable in each case is the number of white balls drawn, which for Polya I is a bounded distribution on the integers 0, ..., r, while Polya II values are on nonnegative integers. Bin(r, p) and N Bin(r, p) distributions mixed with p Beta(W, B) (so for B the Negative Binomial, µ = p/(1 - p) W F2W,2B ) produce Polya I and Polya II distributions, respectively. More generally, W and B need not be integers. A new population parameter N = W + B emerges to produce the 5-parameter distributions (location and scale are the other two) shown in column 5 of Fig. 1. As N , V ar(p) 0, making each Polya limit back to its NEF.

6

CONDITIONAL DISTRIBUTIONS

ind

Let Yi N EF - QV F [µ, V (µ)/ri ], i = 1, 2. Denote the UMVUE of µ as µ (r1 Y 1 + r2 Y 2 )/(r1 + r2 ) N EF - QV F [µ, V (µ)/(r1 + r2 )], a complete ^ sufficient statistic. Because both Y 1 and µ are unbiased estimates of µ, E(Y 1 |^) ^ µ µ ^ also must be an unbiased estimate of µ, and by completeness E(Y 1 |^) = µ. By Eve's Law, E(Var(Y 1 |^)) = V ar(Y 1 ) - V ar(E(Y 1 |^)) = V ar(Y1 ) - V ar(^) = µ µ µ µ ^ V (µ)r2 /[r1 (r1 + r2 )]. Thus [r1 (r1 + r2 )/r2 ]V ar(Y 1 |^), a function of µ, is an unbiased estimate of V (µ). The UMVUE of V (µ) is easily checked to be V (^)(r1 + r2 )/(v2 + r1 + r2 ). By completeness, these two unbiased estimates µ

12

must be the same function of µ. This determines Var(Y 1 |^) and gives ^ µ Y 1| µ ^ µ, ^ r2 v2 + r1 + r2 V (^) µ r1 , (6.1)

as in Morris (1983, Sec. 4). Once again, the conditional distribution has inherited the same VF, V (·). Figure 1 uses dashed arrows to locate these conditional distributions, in the purple octagons at the upper right of each NEF-QVF. The conditional distributions for Normals are Normals, conditional Poissons are Binomials, conditional Gammas are Betas, conditional Binomials are Hypergeometrics (e.g. for Fisher's Exact Test), and conditional Negative Binomials are Negative Hypergeometrics. Formula (6.1) provides the first two moments for the NEF-CHS, a distribution not yet named or fully investigated. In Figure 1, the Polya I and Negative Hypergeometric distributions coincide under appropriate re-parameterizations. To see this, consider Polya's urn schemes once again. Let X be the number of white balls drawn from an urn starting with W white and B blue balls, drawing until r blue balls appear, so X NegHG(r, WW ). Alternatively, consider an urn with r white and +B B - r + 1 blue balls initially, drawing W balls from the urn. With Y as the number of white balls drawn (with double replacement), Y Polya I X (i.e. Pr(X = a) = Pr(Y = a) for all a). The discrete distributions in columns 4-5 of Figure 1 all arise from urn schemes with various stopping and replacement rules. For stopping rule I [II], sampling with replacement yields Binomials [Negative Binomials], which are NEF-QVFs. Sampling with double replacement yields Polya I [II] as conjugate mixtures, and sampling without replacement yields Hypergeometrics [Negative Hypergeometrics] as conditional distributions. When Hypergeometrics and Negative Hypergeometrics arise as conditional distributions, r1 and r2 (the convolution parameters of Y1 and Y2 ) correspond to the initial number of each color in the urn. The population parameter r1 + r2 then arises so both of these distributions (with location and scale parameters) have 5 parameters.

7

LIMITS IN DISTRIBUTION

Limits of distributions are found with left-directed solid arrows in Figure 1. Limits always lead to distributions and approximations with few parameters. Normal, Poisson, and Gamma NEFs arise as limits of Binomial, Negative Binomial, and NEF-CHS families. The latter three NEFs, all with four parameters, cannot be limits of each other, and they comprise the three distinct fundamental NEF-QVFs. [Note: To reduce clutter, Figure 1 avoids showing transitive limits. E.g. Negative Binomials have Poisson limits, and Poissons have Normal limits, so Negative Binomials also have Normal limits, but that isn't shown with a separate arrow.] Normal distributions provide the simplest non-trivial (i.e. with a positive variance) NEF limits. Hence its role in the central limit theorem. Only the law 13

of large numbers (LLN), with (trivial) constant limits, takes a simpler form in Fig. 1 and is more widely achievable.

7.1

NEF Convergence via Variance Functions

Because a variance function characterizes an NEF family, convergence of NEF variance functions implies convergence of distributions. For a Bin(r, p), when r and p 0 with mean rp held constant, the variance rp(1-p) = (1-/r) asymptotes to . Since this limiting variance equals the mean for this NEF, the asymptotic distribution is Poisson. Likewise, a N Bin(r, p) with mean = rp/(1 - p) held constant as r and p 0, has variance rp/(1 - p)2 = (1 + /r) . Hence, Binomials and Negative Binomials have Poisson limits in distribution. Negative Binomials and NEF-CHS distributions have Gamma limits when r stays fixed and µ . The mean of NBin(r, p)/r is µ = p/(1 - p), and its variance is (µ2 + µ)/r, while the NEF-CHS has mean µ and variance (µ2 + 1)/r. For large µ, both of these VFs approximate (in ratio) µ2 /r, the VF of the Gamma family. Since µ , a more formally rigorous argument would require the introduction of a scale parameters and both random variables being divided by µ (justifying a continuous approximation to the Negative Binomial). Solid leftward arrows in Figure 1 reveal these limiting approximations.

7.2

Convergence of NEF-QVF Relatives

Each conditional distribution (Sec 6, octagons in Figure 1) has its NEF-QVF (Sec 3, the Figure 1 ellipses) as a limit. Let Y1 and Y2 NEF-QVF, with convolutions parameters r1 and r2 respectively. As r2 with r1 fixed, Y2 dominates Y1 + Y2 , so Y1 |Y1 + Y2 Y1 |Y2 Y1 NEF-QVF. E.g. Hypergeometrics have Binomial limits. As the convolution parameter for the conjugate prior (r0 ) goes to infinity, the conjugate mixture distribution (Sec 5, hexagons on Figure 1) limits back to the NEF-QVF from which it came. If r0 , we essentially know µ exactly, so Y Y |µ NEF-QVF. E.g. Polya I (Beta-Binomial) can limit to Binomial. Alternatively, as the convolution parameter for the NEF-QVF (r) goes to infinity, the conjugate mixture distribution limits to the mean parameter's ¯ Pearson conjugate distribution. As r , then Y µ by the LLN, so ¯ µ P C (Pearson Conjugate). E.g. Polya I can limit to Beta. Y The remaining limit arrows on Figure 1 may be explained by the "NEFQVF Four Color Problem": If one red circle (NEF-QVF) limits to another red circle, then each relative (blue square, green hexagon, and purple octagon) limits to the corresponding color relative of the limiting NEF-QVF.

14

8

CONCLUSION

This paper has reviewed unifications of the six univariate one-parameter NEFQVF families. Additional probabilistic results and proofs that unify NEF-QVFs are in Morris (1982, 1983), plus a short survey in Morris (1985). Probabilistic results (Morris 1982) include infinite divisibility, cumulant and moment formulae, orthogonal polynomials (including those of Hermite, Poisson-Charlier, and Laguerre), and large deviation bounds. Statistical results (Morris 1983) include unbiased estimation, Bhattacharyya bounds (via orthogonal polynomials), and more on Pearson conjugate distributions. Infinitely many non-QVF univariate NEFs exist, but almost none are named. The most beautiful non-quadratic NEF has a cubic monomial VF, V (µ) µ3 . This is the Inverse Gaussian distribution, which we label "N EF7 ". We also propose adopting the acronym "TWIG" for these N EF7 distributions to indicate Tweedie, Wald, and Inverse Gaussian. Wald showed that TWIG arises as the distribution of a waiting time for Brownian Motion with drift (Seshadri 1993). "TWIG" also recognizes M. C. K. Tweedie's (1957) discovery of TWIG's remarkable statistical sampling properties, that the pair of complete sufficient statistics for the mean and the convolution parameter are independent, and the latter (properly scaled) has a 2 distribution. Letac and Mora n-1 (1990) identified all possible NEFs with proper cubic VFs, showing that there are exactly six. See also Letac (1992). Another named univariate NEF is the Von Mises family of distributions on the unit circle, Brown (1986). The Multivariate Normal is the only multivariate NEF with a fully-parameterized covariance matrix. Patil (1985) and Brown (1986) use "Linear Exponential Family" (LEF) to refer to multivariate NEFs. The Multinomial distribution is a multivariate NEF (LEF) with a quadratic covariance matrix, but with a very restrictive parameterization. Also see Bar-Lev et. al. (1994) on multivariate NEFs. NEFs lead naturally to quasi-likelihood methods, as pioneered by Wedderburn (1974), who originated the term "variance function". The VF is central to quasi-likelihood methods and to generalized linear models (McCullagh and Nelder 1989). "Exponential tilting" is a recent synonym for "generation" in exponential families, used when focused on devising accurate tail approximations to distributions, Davison (2003). The distributions and relationships in Figure 1 form the core for probability and statistics courses because NEF-QVFs arise so widely as sampling distributions for real data. We offer Figure 1 here, believing that it will aid and deepen insights and understanding for students, faculty, and all practitioners of probability and statistics.

References

[1] Bar-Lev, S.; Bshouty, D.; Enis, P.; Letac, G.; Lu, I-L.; Richards, D. (1994) The diagonal multivariate natural exponential families and their classifica-

15

tion. J. Theoret. Probab. 7, 883-929. [2] Barndorff-Nielson, O. (1978), Information and Exponential Families in Statistical Theory, New York:Wiley. [3] Brown, L. D. (1986), Fundamentals of Statistical Exponential Families,, Hayward, California: Institute of Mathematical Statistics, Lecture Notes Monograph Series, 9. [4] Consonni, G. and Veronese, P. (1992). "Conjugate priors for exponential families having quadratic variance functions", Journal of the American Statistical Association, 87, 11231127. [5] Davison, A.C. (2003). Statistical Models, Cambridge, UK:Cambridge University Press. [6] Diaconis, P., and Ylvisaker, D. (1979), "Conjugate Priors for Exponential Families", The Annals of Statistics, 7, 269-281. [7] Diaconis, P., Khare, K., Saloff-Coste, L. (2008). "Gibbs Sampling, Exponential Families and Orthogonal Polynomials," To appear in Statistical Science. [8] Esch, D. (2003). "Applications and Extensions of Three Statistical Models," Ph.D. dissertation, Harvard University, May 22, 2003. [9] Evans, M., Hastings, N., and Peacock, B. (1993) Statistical Distributions (2nd ed), New York:Wiley. [10] Feller, W. (1950) In Introduction to Probability Theory and its Applications, (Vol 1), New York:Wiley. [11] Gutierrez-Pena, E. and Smith, A. (1997). Exponential and Bayes in conjugate families: Review and extensions, Test 6, 190. [12] Harkness, W. L., and Harkness, M. L. (1968), "Generalized Hyperbolic Secant Distributions," Journal of the American Statistical Association, 63, 329-337. [13] Jackson, D. A., O'Donovan, T. M., Zimmer, W. J., and Deely, J. J. (1970), "Minimax Estimators in the Exponential Family," Biometrika, 57, 439-443. [14] Johnson, N. L., and Kotz, S. (1970), Continuous Univariate Distributions - 2, Boston:Houghton-Mifflin. [15] Kendall, M., Stuart, A., and Ord, K. (1987). Kendall's Advanced Theory of Statistics : Volume 1 Distribution Theory (5th ed.), New York:Oxford. [16] Leemis, L.M. and McQueston, J. T. (2008). "Univariate Distribution Relationships," The American Statistician, 62, 1, 45-53.

16

[17] Letac, G. (1992). "Lectures on Natural Exponential Families and Their Variance Functions." Monografias de matematica, 50, I.M.P.A., Rio de Janeiro. [18] Letac, G. and Mora, M. (1990). "Natural Real Exponential Families with Cubic Variance Functions," The Annals of Statistics, 18, 1, 1-37. [19] Manoukian, E. B. and Nadeau, P. (1988). "A Note on the HyperbolicSecant Distribution," The American Statistician, 42, 1, 77-79. [20] McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models (2nd ed), Boca Raton:Chapman & Hall. [21] Morris, C. N. (1982), "Natural Exponential Families with Quadratic Variance Functions," The Annals of Statistics, 10, 65-80. [22] Morris, C. N. (1983), "Natural Exponential Families with Quadratic Variance Functions: Statistical Theory," The Annals of Statistics, 11, 515-529. [23] Morris, C. N. (1985), "Natural Exponential Families," Encyclopedia of Statistical Sciences (Vol 6), ed. Kotz and Johnson, New York:Wiley, 157-159. [24] Morris, C. N. (1988), "Approximating Posterior Distributions and Posterior Moments", Bayesian Statistics III, ed. J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith, Oxford University Press, 327 - 344. [25] Patil, G.P. (1985). "Linear Exponential Family", Encyclopedia of Statistical Sciences (Vol 5), ed. Kotz and Johnson, New York:Wiley, 22-24. [26] Seshadri, V. (1993). The Inverse Gaussian Distribution: A Case Study in Exponential Families", Oxford University Press. [27] Skates, S. (1993). "On Secant Approximations to Cumulative Distribution Functions," Biometrika, 80, 1, 223-235. [28] Tweedie, M. C. K. (1957). "Statistical Properties of Inverse Gaussian Distribution I, II" The Annals of Mathematical Statistics, 28, 2 362-377, 3 396-705. [29] Walter, G. G. and Hamedani, G. G. (1991). "Bayes Empirical Bayes Estimation for Natural Exponential Families with Quadratic Variance Functions," The Annals of Statistics, 19, 3, 1191-1224. [30] Wedderburn, R. M. W. (1974), "Quasi-likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method," Biometrika, 61, 439-447.

17

#### Information

18 pages

#### Report File (DMCA)

Our content is added by our users. **We aim to remove reported files within 1 working day.** Please use this link to notify us:

Report this file as copyright or inappropriate

1122787

### You might also be interested in

^{BETA}