Read poster_2011_06_27.pdf text version

Beta processes, stick-breaking, and power laws

Tamara Broderick

UC Berkeley [email protected]

Michael I. Jordan

UC Berkeley [email protected]

Jim Pitman

UC Berkeley [email protected]

8th Workshop on Bayesian Nonparametrics


The beta-Bernoulli process provides a Bayesian nonparametric prior for models involving collections of binary-valued features. A draw from the beta process provides an infinite collection of probabilities in the unit interval, and a draw from the Bernoulli process turns these into binary-valued features. Recent work has shown how to derive stick-breaking representations for the beta process, by analogy to Sethuraman's derivation of a stick-breaking representation for the Dirichlet process. We show how to derive one such stick-breaking representation directly from the characterization of the beta process as a completely random measure. We show that this approach motivates a three-parameter generalization of the beta process, and we study the power laws that can be obtained from this generalized beta process.

Power Laws in Clustering and Featural Models

Clustering. Given a number N of draws from a discrete random probability measure, let (N1, N2, . . .) denote the sequence of counts associated with the unique values obtained among the N draws, where we view these unique values as "clusters." Let the number of clusters drawn exactly j times be KN,j , and the total number of clusters be KN .

q1 q2 q3 q4 q5 qi


Beta Process Power Laws

Types I and II. N an integer makes finding power laws in KN , KN,j difficult. Poissonizing KN and KN,j replaces them with continuous versions K(t) and Kj (t) such that a.s. K(N ) KN and a.s. Kj (N ) KN,j .

KN,j :=


1(Ni = j), KN =


1(Ni > 0).

There are two types of power-law behavior that a clustering model might exhibit. First, there is what we call a Type I power law, reminiscent of Heaps' law:








Type I Power Law KN bN a as N for some constants b > 0, a (0, 1).

Next, there is a law reminiscent of Zipf's law:

a(j-a) bN a j!


The first five sets of points show Poisson processes with rates q1, . . . , q5. The bottom set of points collects the points above and is a Poisson process with rate i qi. Kj (t) will be the number of these Poisson processes with j points in the interval [0, t], and K(t) is the sum over j. In this example, K(1) = 2, K(4) = 5, and K2(4) = 1. Lemma 1. For feature frequencies from a Poisson process with infinite rate measure, as N , |EKN - EK(N )| 0, |EKN,j - EKj (N )| 0. Lemma 2. For feature frequencies from a Poisson process, as N a.s. a.s. , KN EKN , k<j KN,k k<j EKN,j . Lemma 3. For feature frequencies from a Poisson process with mean measure , l a slowly varying function, and a (0, 1): a 1-a x (dx) x l(1/x) (x 0) implies 1-a 0 al(t), EK (t) a(j - a) tal(t) (t ). EK(t) (1 - a)t j j! The lemmas above combine to give the following proposition when applied to the three-parameter beta process.


Beta and Bernoulli Processes

(a) Beta process rate A completely random measure (CRM) is constructed from a Poisson process = {(i, Ui)}i as i=1 Uii . The beta process [4], B BP(, B0), is the CRM with rate measure BP(d, du) = u-1(1 - u)-1 du · B0(d), u [0, 1], When B1 is discrete with atoms in (0, 1], i.e. B1 = Uii and i=1 Ui (0, 1], the Bernoulli process, Y BeP (B1) is of the form Y = Wii i=1 with Wi Bern(Ui). Form a matrix Z {0, 1}N ×K as follows. Draw B BP(, B0) and then Yn|N BeP (B). In n=1 this case, Y places unit masses at some finite subset {k }K of the k=1 atoms of B. Write Zn,k = Yn(k ). Then we say Z BP-BeP(N, , B0) is drawn from a beta-Bernoulli process.

iid indep



N . It can be shown that these two power laws imply each other and the constants are the same in both cases [1]. Treating N as constant in the last equation implies what we call a Type II power law; here, c = abN a.

Type II Power Law KN,j cj -1-a as j for some constants c > 0, a (0, 1).

Featural. In the featural case, we use the same notation, but now we note that each draw can place non-zero mass on more than a single atom, and thus the counts Ni no longer sum to N in general. We can still talk about Type I and II power laws with the same interpretation as above. In the featural case, however, it is also possible to consider a third type of power law. If we let kn denote the number of features present in the nth draw, we say that kn shows Type III power law behavior if


(b) Beta process draw


Proposition 2 For the beta process in Eq. (2) and C := KN CN

Type III.

a.s. d a.s. d(j-d) , KN,j j!(1-d) CN d (1+) d(+d) ,


(N ).


(c) Bernoulli draws

50 40 30 20 10 0

Type III Power Law P(kn > M ) rM -s as M for some constants r, s > 0.


Proposition 3 For feature frequencies such that qi 0 and i qi < , Zi independent Bernoulli random variables with success probability qi, and large enough N , P[ i Zi > N ] < eN - N N -N .


(d) Feature matrix

Beta Process Stick-breaking

In the clustering case, the Dirichlet process exhibits neither Type I nor Type II power law behavior. However, the Pitman-Yor process yields both kinds of power law. As the one-parameter Dirichlet process generalizes to the two-parameter Pitman-Yor process with an extra parameter in the stick fractions Beta(1 - d, + id), so too we might consider generalizing the stick-breaking representation of the beta process in [2] with an extra parameter: B= Ci Pois(),

iid (i) i-1 (l) Ci i=1 j=1 Vi,j l=1 (1 - Vi,j )i,j , iid 1 (l) indep Vi,j Beta(1 - d, + id), i,j B0.

Experimental Results

To illustrate our results in the context of a concrete application, we study a discrete factor analysis model previously considered by [2]. The model is of the form X = Z + E, where X RN ×P is the data and E RN ×P is an error matrix. The matrix RK×P is a matrix of factors, and Z RN ×K is a binary matrix of factor loadings. The K rows of comprise a potentially infinite collection of factors. The matrix Z is drawn from a beta-Bernoulli process of size N . (a) Number of factors

Two-parameter model 50 40 50 40 Autocorrelation Three-parameter model


Dirichlet Process Stick-breaking

1 V1

To derive the three-parameter stick-breaking representation, we start by generalizing the beta process to a three-parameter version. (1 + ) BP(d, du) = u-1-d(1 - u)+d-1 du B0(d). (2) (1 - d)( + d)

(c) # factors ACF

Two-parameter model 1 K 0.5

30 K

30 K


V2(1 - V1) 1 - V1

Proposition 1 B is constructed according to the stick-breaking process in Eq. (1) if and only if B BP(, d, B0).



-0.5 10 10 -1 0 500 1000 1500 Iteration number 2000 1 0 50 Lag 100



500 1000 1500 Iteration number



Three-parameter model K d 0.5 Autocorrelation


Stick-breaking is the process of recursively breaking off random fractions of the unit interval (far left). First, we break off a random fraction V1; the remaining stick has length 1 - V1 (middle left). Next, we break off a random fraction V2 of the remaining stick, i.e. a fragment of size V2(1 - V1); the remaining stick has length (1 - V1)(1 - V2). This process generates stick fragments V1, V2(1 - V1), . . . , Vi j<i(1 - Vj ), . . .. These fragments form a random partition of the unit interval (far right). [3] showed that the Dirichlet process (DP) arises from the special case in which Vi are independent draws from the Beta(1, ) distribution. So we can write a draw G DP(, G0) as:

i-1 G = l=1 (1 - Vl ) i iid iid Vi Beta(1, ), i G0, i=1 Vi

(b) Stick parameters Proof. As a Poisson number of conditionally independent and identically distributed points, i-1 i-1 (l) i,1, V (i) (1 - V (l)) , . . . , i,C , V (i) (1 - Vi,C ) , Pi := i,1 i,1 i,Ci i i

l=1 l=1

Two-parameter model 100 Three-parameter model 100 Three-parameter model 1




d 0.5

-0.5 0 0 0


1000 2000 Iteration number


1000 2000 Iteration number


1000 2000 Iteration number



50 Lag


Pi is a Poisson point process. So P := process.

i=1 Pi is a Poisson point

(d) Top features

Two-parameter model

~ ~ Let be the rate measure of P . Then both (A × A) = B0(A)µ(A) ~ ~ and BP(A × A) = B0(A)µBP(A) for some measures µ and µBP. So it is enough to show µ = µBP. ~ Taking a size-biased pick U from the vector of stick-breaking fre~ ~ quencies, we can find both P(U du) = uµ(du) and U Beta(1 - d, + d). Therefore, ~ µ(du) = u-1P(U du) = u-1(1 + ) (1 - d)( + d) u-d(1 - u)+d-1 = µBP(du)

Three-parameter model


[1] A. Gnedin, B. Hansen, and J. Pitman. Notes on the occupancy problem with infinitely many boxes: General asymptotics and power laws. Probability Surveys, 4:146­171, 2007. [2] J. Paisley, A. Zaas, C. W. Woods, G. S. Ginsburg, and L. Carin. A stick-breaking construction of the beta process. In International Conference on Machine Learning, Haifa, Israel, 2010. [3] J. Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, 4(2):639­650, 1994. [4] R. Thibaux and M. I. Jordan. Hierarchical beta processes and the Indian buffet process. In International Conference on Artificial Intelligence and Statistics, San Juan, Puerto Rico, 2007.


1 pages

Report File (DMCA)

Our content is added by our users. We aim to remove reported files within 1 working day. Please use this link to notify us:

Report this file as copyright or inappropriate


Notice: fwrite(): send of 199 bytes failed with errno=104 Connection reset by peer in /home/ on line 531