Cross-Language Bayesian Models for Web-Scale Text Analysis Using MapReduce

Project funded by the National Science Foundation (CCF-1018625)
PI: Jimmy Lin, Co-PIs: Jordan Boyd-Graber, Philip Resnik
University of Maryland

Note: This project concluded in August 2014. This website is no longer actively maintained, and is available primarily for archival purposes.

Overview

The web promises unprecedented access to the perspectives of an enormous number of people on a wide range of issues. Turning that still untamed cacophony into meaningful insights requires dealing with the linguistic diversity and scale of the web. Most current research focuses on specialized tasks such as tracking consumer opinions. They frequently treat the web as both monolithic and monolingual, ignoring the variety of languages represented and the rich interplay between topics and issues under discussion.

Over the past few years, we have advanced the state of the art by focusing on three key challenges:

First, we have developed scalable algorithms for linguistic modeling within a Bayesian framework. This includes development of novel techniques for distributed variational inference (using MapReduce), adaptation of online variational inference to support changing vocabularies, and extension of online variational inference to adaptor grammars.
Second, we have built and validated novel Bayesian models that learn consistent interpretations of text across languages and a wide range of response variables of interest (for example, views on an issue, strength of emotion relative to an event, and focus of attention).
Finally, we have introduced a new interactive paradigm for topic modeling that allow users to provide feedback to improve model quality. Interactive topic models make our tools more accessible to users without a background in machine learning (for example, domain experts in biology and policy makers).

We have applied our Bayesian modeling approaches to a variety of domains and have built multi-disciplinary collaborations with many researchers. This includes the development of hierarchical clustering methods for developmental biology (with researchers at the Agricultural Research Service) and for modeling protein structure (with researchers at the National Institutes of Health). In addition, we have collaborated with social scientists to validate our models for detecting social influence. Our work on interactive topic modeling has attracted significant interest from humanists and social scientists.

Finally, we are committed to broader dissemination of our efforts through open-source software. One of the products from this project is Mr.LDA, an open-source toolkit for flexible, scalable, multilingual topic modeling using variational inference in MapReduce. In addition, our fast single-machine hierarchical topic modeling code has been incorporated into Mallet, the community-standard topic modeling package.

<< back to top

Project Team

	Jordan Boyd-Graber Assistant Professor, Computer Science, University of Colorado
	Hal Daumé III Associate Professor, Computer Science and UMIACS, University of Maryland
	Jimmy Lin Associate Professor, The iSchool and UMIACS, University of Maryland
	Philip Resnik Professor, Linguistics and UMIACS, University of Maryland
	Nima Asadi Ph.D., Computer Science, University of Maryland (Graduated Summer 2013)
	Vlad Eidelman Ph.D. student, Computer Science, University of Maryland (Graduated Winter 2013)
	Alan Du High School student (summer intern)
	He He Ph.D. student, Computer Science, University of Maryland
	Yuening Hu Ph.D., Computer Science, University of Maryland (Graduated Summer 2014)
	Viet-An Nguyen Ph.D. student, Computer Science, University of Maryland
	Brianna Satinoff M.S., Computer Science, University of Maryland (Graduated Spring 2011)
	Ke Zhai Ph.D., Computer Science, University of Maryland (Graduated Winter 2014)

<< back to top

Publications

Ke Zhai, Jordan Boyd-Graber, Shay B. Cohen. Online Adaptor Grammars with Hybrid Inference. Transactions of the Association for Computational Linguistics, 2:465-476, 2014.

Thang Nguyen, Yuening Hu, and Jordan Boyd-Graber. Anchors Regularized: Adding Robustness and Extensibility to Scalable Topic-Modeling Algorithms. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), 2014, pages 359-369, Baltimore, Maryland.

Yuening Hu, Ke Zhai, Vlad Eidelman, and Jordan Boyd-Graber. Polylingual Tree-Based Topic Models for Translation Domain Adaptation. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), 2014, pages 1166-1176, Baltimore, Maryland.

Mohit Iyyer, Peter Enns, Jordan Boyd-Graber, and Philip Resnik. Political Ideology Detection Using Recursive Neural Networks. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), 2014, pages 1113-1122, Baltimore, Maryland.

Viet-An Nguyen, Jordan Boyd-Graber, Philip Resnik, Deborah Cai, Jennifer Midberry, and Yuanxin Wang. Modeling Topic Control to Detect Influence in Conversations using Nonparametric Topic Models. Machine Learning, 95(3):381-421, 2014.

Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff, and Alison Smith. Interactive Topic Modeling. Machine Learning, 95(3):423-469, 2014.

Yuening Hu, Jordan Boyd-Graber, Hal Daume III, Z. Irene Ying. Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent. Advances in Neural Information Processing Systems 26, 2013, pages 1079-1087, Lake Tahoe, Nevada.

Viet-An Nguyen, Jordan L. Boyd-Graber, Philip Resnik. Lexical and Hierarchical Topic Regression. Advances in Neural Information Processing Systems 26, 2013, pages 1106-1114, Lake Tahoe, Nevada.

Jordan Boyd-Graber, Kimberly Glasgow, and Jackie Sauter Zajac. Spoiler Alert: Machine Learning Approaches to Detect Social Media Posts with Revelatory Information. Proceedings of the 76th Annual Meeting of the American Society for Information Science and Technology (ASIST 2013), 2013, Montreal, Canada.

Ke Zhai and Jordan Boyd-Graber. Online Topic Models with Infinite Vocabulary. Proceedings of the 30th International Conference on Machine Learning (ICML 2013), 2013, pages 561-569, Atlanta, Georgia.

Naho Orita, Rebecca McKeown, Naomi H. Feldman, Jeffrey Lidz, and Jordan Boyd-Graber. Discovering Pronoun Categories using Discourse Information. Proceedings of the 35th Annual Meeting of the Cognitive Science Society, 2013, Berlin, Germany.

Viet-An Nguyen, Jordan Boyd-Graber, and Stephen Altschul. Dirichlet Mixtures, the Dirichlet Process, and the Structure of Protein Space. Journal of Computational Biology, 20(1):1-18, 2013.

Viet-An Nguyen, Yuening Hu, Jordan Boyd-Graber, and Philip Resnik. Argviz: Interactive Visualization of Topic Dynamics in Multi-party Conversations. Proceedings of the NAACL HLT 2013 Demonstration Session, 2013, pages 26-39, Atlanta, Georgia.

Ke Zhai, Jordan Boyd-Graber, Nima Asadi, and Mohamad Alkhouja. Mr. LDA: A Flexible Large Scale Topic Modeling Package using Variational Inference in MapReduce. Proceedings of the 21th International World Wide Web Conference (WWW 2012), 2012, pages 879-888, Lyon, France.

Vladimir Eidelman, Jordan Boyd-Graber, and Philip Resnik. Topic Models for Dynamic Translation Model Adaptation. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), 2012, pages 115-199, Jeju, Republic of Korea.

Viet-An Nguyen, Jordan Boyd-Graber, and Philip Resnik. SITS: A Hierarchical Nonparametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversations. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), 2012, pages 78-87, Jeju, Republic of Korea.

Yuening Hu and Jordan Boyd-Graber. Efficient Tree-Based Topic Modeling. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), 2012, pages 275-279, Jeju, Republic of Korea.

Jordan Boyd-Graber, Brianna Satinoff, He He, and Hal Daume III. Besting the Quiz Master: Crowdsourcing Incremental Classification Games. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2012), 2012, pages 1290-1301, Jeju, Republic of Korea.

Yuening Hu, Ke Zhai, Sinead Williamson, and Jordan Boyd-Graber. Modeling Images using Transformed Indian Buffet Processes. Proceedings of the 29th International Conference on Machine Learning (ICML 2012), 2012, Edinburgh, Scotland.

Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff. Interactive Topic Modeling. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), 2011, pages 248-257, Portland, Oregon.

<< back to top

Software

Mr.LDA is an open-source toolkit developed as part of this project for flexible, scalable, multilingual topic modeling using variational inference in MapReduce.

A fast implementation of single-machine hierarchical topic modeling, developed as part of this project, has been incorporated into Mallet.

<< back to top

Acknowledgments

This work is supported by the National Science Foundation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the researchers and do not necessarily reflect the views of the National Science Foundation.