Putting the Clouds in Context: Statistical Machine Translation with MapReduce

Funded by the National Science Foundation (IIS-0836560)
PI: Jimmy Lin, Co-PI: Philip Resnik

Note: This project concluded in June 2011. This website is no longer actively maintained, and is available primarily for archival purposes.

Project Overview

In October 2007, Google and IBM jointly announced the Academic Cloud Computing Initiative (ACCI), with the goal of helping both researchers and students address the challenges of "web-scale" computing. The initiative revolves around Google's MapReduce programming framework, which represents a proven approach to tackling data-intensive problems in a distributed manner. Six universities were involved in the collaboration at the outset: Carnegie Mellon University, Massachusetts Institute of Technology, Stanford University, the University of California at Berkeley, the University of Maryland, and University of Washington. See Google press release, IBM press release, and UMD press release.

As part of this initiative, IBM and Google have dedicated a large cluster of several hundred machines for use by faculty and students at the participating institutions. The cluster takes advantage of Hadoop, an open-source implementation of MapReduce in Java. By making these resources available, Google and IBM hope to encourage faculty adoption of cloud computing in their research and also integration of the technology into the curriculum. A few months later, the ACCI teamed up with the National Science Foundation to create the Cluster Exploratory (CLuE) initiative, whereby NSF would provide funding to support research on the ACCI infrastructure. This project was funded under that program.

In the context of this project, we have been exploring the intersection of large-scale text retrieval and statistical machine translation. One thread has been scaling up iterative machine learning algorithms to larger and larger dataset. Another thread has been the application of IR techniques to automatically extract bilingual training data.

Project Team

	Jimmy Lin Associate Professor The iSchool (College of Information Studies), University of Maryland
	Philip Resnik Professor Department of Linguistics, University of Maryland
	Chris Dyer Ph.D. student Department of Computer Science, University of Maryland (graduate Spring 2010)
	Tamer Elsayed Ph.D. student Department of Computer Science, University of Maryland (graduated Summer 2009)
	Ferhan Ture Ph.D. student Department of Computer Science, University of Maryland

Publications (selected)

Ferhan Ture, Tamer Elsayed, and Jimmy Lin. No Free Lunch: Brute Force vs. Locality-Sensitive Hashing for Cross-lingual Pairwise Similarity. Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2011), July 2011, Beijing, China.
Jimmy Lin and Chris Dyer. Data-Intensive Text Processing with MapReduce. Morgan & Claypool Publishers, 2010.
Jimmy Lin. Scalable Language Processing Algorithms for the Masses: A Case Study in Computing Word Co-occurrence Matrices with MapReduce. Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP 2008), pages 419-428, October 2008, Honolulu, Hawaii.
Chris Dyer, Aaron Cordova, Alex Mont, and Jimmy Lin. Fast, Easy, and Cheap: Construction of Statistical Machine Translation Models with MapReduce. Proceedings of the Third Workshop on Statistical Machine Translation at ACL 2008, pages 199-207, June 2008, Columbus, Ohio.
Jimmy Lin. Exploring Large-Data Issues in the Curriculum: A Case Study with MapReduce. Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics (TeachCL-08) at ACL 2008, pages 54-61, June 2008, Columbus, Ohio.
Tamer Elsayed, Jimmy Lin, and Douglas Oard. Pairwise Document Similarity in Large Collections with MapReduce. Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL 2008), Companion Volume, pages 265-268, June 2008, Columbus, Ohio.

Broader Impacts

This grant supported the development of Cloud⁹, a Hadoop library for both research and teaching used at Maryland and elsewhere.
This grant supported the development of a textbook on MapReduce algorithm design, available here.
This grant supported multiple iterations of a course on MapReduce and large-data processing at the University of Maryland.

Disclaimer

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Please contact the PI for additional information.