Mr.LDA: Scalable Topic Modeling Using Variational Inference in MapReduce

Introduction

Mr.LDA is an open-source package for flexible, scalable, multilingual topic modeling using variational inference in MapReduce.

Latent Dirichlet Allocation (LDA) and related topic modeling technique are useful for exploring document collections. Because of the increasing prevalence of large datasets, there is a need to improve the scalability of inference for LDA. Unlike other techniques that use Gibbs sampling, Mr.LDA uses variational inference, which easily fits into a distributed environment. More importantly, this variational implementation, unlike highly tuned and specialized implementations based on Gibbs sampling, is easily extensible — examples include informed priors to guide topic discovery and extracting topics from a multilingual corpus.

More details are described in our paper:

Ke Zhai, Jordan Boyd-Graber, Nima Asadi, and Mohamad Alkhouja. Mr. LDA: A Flexible Large Scale Topic Modeling Package using Variational Inference in MapReduce. Proceedings of the 21th International World Wide Web Conference (WWW 2012), 2012, pages 879-888, Lyon, France. [slides]

Mr.LDA was developed in the context of our NSF-funded project on Cross-Language Bayesian Models for Web-Scale Text Analysis Using MapReduce (CCF-1018625).

Getting Started

For instructions on getting started, look at the readme.

Acknowledgments

This work has been supported by the US NSF under awards IIS-0916043 and CCF-1018625. Any opinions, findings, or conclusions are the researchers and do not necessarily reflect those of the sponsors.