Hierarchical Dirichlet Processes

An attempt to intuitively explain HDP, and why it makes sense to use HDPs for joint mixture modeling of multiple related datasets.

Plate model of DP (left) and HDP (right)

Author

Surya

Published

April 15, 2019

Updated

Not revised yet

Introduction

Suppose we consider that every document discusses a set of topics. Our task is to learn the topics. We apriori do not know how many topics are there in a document. A salient approach to this problem is non-parametric Bayesian modeling, wherein we use a Dirichlet Process (DP) prior, which offers probabilities to infinite topics apriori. Using the document data, we could pick the required finite topics a posteriori.

Now, it is natural to expect that various documents share topics. For example, a document on Beta Processes and a document on Gaussian Processes might share a topic like Stochastic. Learning topics for each document seperately denies us an opportunity to share data across documents. We could’ve learned about the topic Stochastic better if we jointly learned on both documents. Note that we don’t want to club both documents into one! This destroys the individual document structure. A solution in statistics which enables joint learning while preserving the document structure is the hierarchical modeling technique.

Let’s take 2 documents, and put up G1, G2 \sim DP (α_0, G_0) priors where G_1 = \sum_{k=1}^{\infty}\pi_{1k}\red{\delta_{\phi_{1k}}} and G_2 = \sum_{k=1}^{\infty}\pi_{2k}\blue{\delta_{\phi_{2k}}} Since G_0 is continuous, none of \red{\phi_{1k}}'s and \blue{\phi_{2k}}'s are equal, implying that there is no sharing of topics across both documents. So, we need a discrete G_0 with broad support. The paper Hierarchical Dirichlet Process proposes a straightforward solution - draw G_0 from another dirichlet process. An example generative story using HDP is given below, wherein H is a Normal-Wilshart distribution.

\begin{aligned} \green{G_0 | \gamma, H} &\sim DP(\gamma, H) \quad \quad \quad \quad \quad \quad G_0 = \sum_{k=1}^{\infty}\beta_{k}\delta_{\phi_k} \\ G_j | \alpha_{0}, G_0 &\sim DP(\alpha_0, G_0) \quad \forall \quad j \quad \quad G_j = \sum_{k=1}^{\infty}\pi_{jk}\delta_{\phi_{jk}}\\ \theta_{ji} | G_j &\sim G_j \\ x_{ji} | \theta_{ji} &\sim N(x_{ji} | \mu_{ji}, \Sigma_{ji}) \\ \end{aligned}

One appealing aspect of the model is its recursiveness, as in, we could easily extend the model to another level of hierarchy by sampling H from another DP. This can be used to learn topics jointly from multiple corpora. For example, it is likely that documents from Statistics corpora and Computer Science corpora share a topic like ’Python’.

Stick Breaking Construction

How are \beta_k and \pi_{jk} related? Sethuraman suggested a stick breaking construction to compute \beta_k's. The target is to find a sequence of \beta_k's such that \sum_{k=1}^{\infty} \beta_k = 1. The construction is as follows - we recursively break a stick of unit length. Let us call this the parent stick. \beta_k' \sim \text{Beta}(1, \gamma) \quad \quad \quad \quad \beta_k = \beta_k'\prod_{l=1}^{k-1}(1-\beta_l') For our case with J documents (groups), we can think of recursively breaking J sticks of unit length such that \sum_{k=1}^{\infty} \pi_{jk} = 1 for all j. The catch is that each piece length of j^{th} stick depends on the piece lengths in parent stick, thereby establishing a hierarchical flavor. \pi_{jk}' \sim \text{Beta}(\alpha_0\green{\beta_k}, \alpha_0(\sum_{l=k+1}^{\infty}\green{\beta_l})) \quad \quad \quad \pi_{jk} = \pi_{jk}'\prod_{l=1}^{k-1}(1-\pi_{jl}') The stick breaking process works for \pi_{jk}'s because it can be shown that \pi_{j} \sim \text{DP}(\alpha_0, \beta). Note the contrast with respect to a normal DP. We'd have drawn \pi_{jk}' from Beta(1, \alpha_0), in which there is no sharing of values.

Chinese Restaurant Franchise

The CRP metaphor for Dirichlet Process can be extended to the Chine Restaurant Franchise metaphor to explain HDP intuitively. The core idea is that, in addition to customers going to restaurants and picking tables, we now have tables taking up a customer role and going to a parent restaurant. For simplicity's sake, let's take the case of 2 restaurants X and Y. A, B, C are tables in restaurant X which has 4 customers. D and E are tables in restaurant Y with 5 customers. The parent restaurant has tables p, q, and r, with the 5 tables from two restaurants being customers. The customer seating arrangement is as shown in Figure 1. Since A and D sit at table p, it means the dishes served at table A and D would be the same as p.

Suppose a new customer 5 comes in restaurant X. She would sit at table A with probability \frac{1}{4 + \alpha_0}, and eat dish p (since A sits at p in parent restaurant). Similarly, she'd sit at table B with probability \frac{1}{4 + \alpha_0} and eat dish q; at table C with probability \frac{2}{4 + \alpha_0} and eat dish p. Note that the numerator is the number of customers in restaurant X sitting at that particular table, and the denominator is \alpha_0 plus the number of customers in the restaurant before current customer arrived. Customer 5 chooses to sit at a new table with probability \frac{\alpha_0}{4 + \alpha_0}. Now, we could think of a waiter at the new table going to the parent restaurant, and a CRP starts at parent restaurant with the waiter as a new customer. The probabilities are as shown in Table 1

Table 1: CRP in parent restaurant
Table/Dish	Probability
p	\frac{2}{5 + \gamma}
q	\frac{1}{5 + \gamma}
r	\frac{2}{5 + \gamma}
New Dish	\frac{\gamma}{5 + \gamma}

In case the waiter sits at table p, she'd call restaurant X and tell the chef to serve dish p at the new table that customer 5 chose. If the waiter chooses a new dish, say s, then this new dish s will be served to customer 5.

The process in similar in Restaurant Y too. A new customer would either sit at one of the tables, or choose to sit at a new table. If she prefers a new table, a waiter would go to the parent restaurant and choose a dish according to another CRP. Notice that there is a global set of dishes; a subset of those dishes is being served in each restaurant.

Final Comments

The dishes p, q, r are analogous to \phi_k's, ie, the support of G_0. A, B, C, D, E are analogous to \theta_{11}, \theta_{12}, \theta_{13}, \theta_{21}, \theta_{22} respectively. Note that \theta_{11} and \theta_{21} are the same, as in, the latent variables are being shared across the 2 groups. Due to such sharing of latent variables, it makes sense that we use CRF as a prior for jointly modeling multiple related datasets. The posterior predictive distribution would pool the information across multiple restaurants before assigning a table to a new customer in a restaurant.

Acknowledgements

The article was prepared using the Distill Template

I wrote this article as a part of the course CS698X: Probabilistic Modeling and Inference by Prof. Piyush Rai

Updates and Corrections

If you see mistakes or want to suggest changes, please contact me