Ivory: A Hadoop Toolkit for Distributed Text Retrieval

Project funded by the National Science Foundation (IIS-0916043)
PI: Jimmy Lin, University of Maryland

This project began September 2009 and ended September 2013. This website exists for archival purposes only and is not being actively maintained.

Overview

Text retrieval is a technology that is vital for modern information-based societies, and today's systems must handle petabytes of data. Clearly, this challenge overwhelms the capabilities of individual servers, necessitating the processing power offered by networked clusters. Yet, academic information retrieval research is still dominated by single-core solutions.

The question we ask in this project is: What should information retrieval systems look like in the era of "big data"? In particular, this project takes advantage of the MapReduce programming framework, via the Hadoop open-source implementation. We explore the design space of efficient algorithms for indexing and retrieval, with the goal of understanding tradeoffs between:

Quality (How good are the search results?)
Time (How fast is the algorithm?)
Space (How much memory do data structures consume?)
Scalability (How much text can our algorithms be applied to?)

<< back to top

Project Team

	Jimmy Lin Associate Professor, The iSchool
	K. Ashwin Kumar Ph.D. student, Computer Science
	Nima Asadi Ph.D., Computer Science (Graduated Summer 2013)
	Ferhan Ture Ph.D., Computer Science (Graduated Summer 2013)
	Lidan Wang Ph.D., Computer Science (Graduated Summer 2012)
	Tamer Elsayed Ph.D., Computer Science (Graduated Summer 2009)

<< back to top

Publications (Selected)

Dissertations

Searching to Translate and Translating to Search: When Information Retrieval Meets Machine Translation.
Ferhan Ture. Ph.D. Dissertation, University of Maryland, College Park, 2013.

Multi-Stage Search Architectures for Streaming Documents.
Nima Asadi. Ph.D. Dissertation, University of Maryland, College Park, 2013.

Learning to Efficiently Rank.
Lidan Wang. Ph.D. Dissertation, University of Maryland, College Park, 2012.

Journal Articles

Document Vector Representations for Feature Extraction in Multi-Stage Document Ranking.
Nima Asadi and Jimmy Lin. Information Retrieval, 16(6):747-768, 2013.

Conference Papers

Hone: "Scaling Down" Hadoop on Shared-Memory Systems.
K. Ashwin Kumar, Jonathan Gluck, Amol Deshpande, and Jimmy Lin. Proceedings of the 39th International Conference on Very Large Data Base (VLDB 2013), page 1354-1357, August 2013, Trento, Italy.

Effectiveness/Efficiency Tradeoffs for Candidate Generation in Multi-Stage Retrieval Architectures.
Nima Asadi and Jimmy Lin. Proceedings of the 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2013), pages 997-1000, July 2013, Dublin, Ireland.

Combining Statistical Translation Techniques for Cross-Language Information Retrieval.
Ferhan Ture, Jimmy Lin, and Douglas W. Oard. Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), pages 2685-2702, December 2012, Mumbai, India.

Fast Candidate Generation for Two-Phase Document Ranking: Postings List Intersection with Bloom Filters.
Nima Asadi and Jimmy Lin. Proceedings of 21th International Conference on Information and Knowledge Management (CIKM 2012), pages 2419-2422, October 2012, Maui, Hawaii.

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling.
Ferhan Ture and Jimmy Lin. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL/HLT 2012), pages 626-630, June 2012, Montreal, Quebec, Canada.

When Close Enough Is Good Enough: Approximate Positional Indexes for Efficient Ranked Retrieval.
Tamer Elsayed, Jimmy Lin, and Don Metzler. Proceedings of 20th International Conference on Information and Knowledge Management (CIKM 2011), pages 1993-1996, October 2011, Glasgow, Scotland.

No Free Lunch: Brute Force vs. Locality-Sensitive Hashing for Cross-lingual Pairwise Similarity.
Ferhan Ture, Tamer Elsayed, and Jimmy Lin. Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2011), page 943-952, July 2011, Beijing, China.

A Cascade Ranking Model for Efficient Ranked Retrieval.
Lidan Wang, Jimmy Lin, and Donald Metzler. Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2011), page 105-114, July 2011, Beijing, China.

Ranking Under Temporal Constraints.
Lidan Wang, Donald Metzler, and Jimmy Lin. Proceedings of 19th International Conference on Information and Knowledge Management (CIKM 2010), pages 79-88, October 2010, Toronto, Canada.

Learning to Efficiently Rank.
Lidan Wang, Jimmy Lin, and Donald Metzler. Proceedings of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2010), pages 138-145, July 2010, Geneva, Switzerland.

Of Ivory and Smurfs: Loxodontan MapReduce Experiments for Web Search.
Jimmy Lin, Donald Metzler, Tamer Elsayed, and Lidan Wang. Proceedings of the Eighteenth Text REtrieval Conference (TREC 2009), November 2009, Gaithersburg, Maryland.

<< back to top

Software

Ivory is an open-source Hadoop toolkit for web-scale information retrieval research supported in part by this project. It serves as an experimental platform for much of the research described here.

<< back to top

Acknowledgments

This work is supported by the National Science Foundation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the researchers and do not necessarily reflect the views of the National Science Foundation.