Learning to Efficiently Rank with Cascades

Funded by the National Science Foundation (IIS-1144034)
PI: Jimmy Lin

Note: This project concluded in August 2014. This website is no longer actively maintained, and is available primarily for archival purposes.

Project Overview

When it comes to search, users desire results that are not only good but also fast. That is, ranking algorithms should be both effective and efficient. These two desiderata, however, are often in tension. To obtain high-quality results, developers typically take advantage of machine learning techniques to rank documents based on a multitude of features (so-called "learning to rank"), e.g., matching term scores, phrase scores, static document features, etc. In the web context, state-of-the-art rankers use hundreds of features or more. In general, the more features an algorithm considers, the better the quality overall. However, considering more features takes more time, particularly for certain types of features that are computationally expensive. Thus, we often observe an inverse relationship between quality and speed.

In the information retrieval community, explorations in effectiveness and efficiency have been largely disjoint. This is problematic in that a piecemeal approach may yield ranking models that are impractically slow on web-scale collections or algorithmic optimizations that sacrifice quality to an unacceptable degree. The aim of our work is to develop an integrated framework to building search systems that are both effective and efficient. To this end, we have been exploring a research program, dubbed "learning to efficiently rank", that allows algorithm designers to capture, model, and reason about tradeoffs between effectiveness and efficiency in a unified machine-learning framework.

Our core idea is to consider the ranking problem as a "cascade", where ranking is broken into a finite number of distinct stages. Each stage considers successively richer and more complex features, but over successively smaller candidate document sets. The intuition is that although complex features are more time-consuming to compute, the additional overhead is offset by examining fewer documents. In other words, the cascade model views retrieval as a multi-stage progressive refinement problem, where each stage balances the cost of exploiting various features with the potential gain in terms of better results. We have explored this notion in the context of linear models and tree-based models.

In addition to the document ranking problem, we have also explored the design of search architectures. Instead of treating the search engine as a monolithic entity, we have experimented with a multi-stage architecture where retrieval is decomposed into a candidate generation stage, a feature extraction stage, and a reranking stage using machine-learned models.

Our findings are detailed in the publications listed below.

Project Team

	Jimmy Lin Associate Professor The iSchool (College of Information Studies), University of Maryland
	Hua He Ph.D. student Department of Computer Science, University of Maryland
	Nima Asadi, Ph.D., Computer Science Graduated Summer 2013 Dissertation Title: Multi-Stage Search Architectures for Streaming Documents
	Lidan Wang, Ph.D., Computer Science Graduated Summer 2012 Dissertation title: Learning to Efficiently Rank

Relevant Publications

Nima Asadi, Jimmy Lin, and Arjen P. de Vries. Runtime Optimizations for Tree-Based Machine Learning Models. IEEE Transactions on Knowledge and Data Engineering, 26(9):2281-2292, 2014.
Nima Asadi and Jimmy Lin. Document Vector Representations for Feature Extraction in Multi-Stage Document Ranking. Information Retrieval, 16(6):747-768, 2013.
Nima Asadi and Jimmy Lin. Effectiveness/Efficiency Tradeoffs for Candidate Generation in Multi-Stage Retrieval Architectures. Proceedings of the 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2013), pages 997-1000, July 2013, Dublin, Ireland.
Nima Asadi. Multi-Stage Search Architectures for Streaming Documents. Ph.D. dissertation, University of Maryland, College Park, 2013.
Nima Asadi and Jimmy Lin. Training Efficient Tree-Based Models for Document Ranking. Proceedings of the 35th European Conference on Information Retrieval (ECIR 2013), pages 146-157, March 2013, Moscow, Russia.
Nima Asadi and Jimmy Lin. Fast Candidate Generation for Two-Phase Document Ranking: Postings List Intersection with Bloom Filters. Proceedings of 21th International Conference on Information and Knowledge Management (CIKM 2012), pages 2419-2422, October 2012, Maui, Hawaii.
Lidan Wang. Learning to Efficiently Rank. Ph.D. dissertation, University of Maryland, College Park, 2012.
Lidan Wang, Jimmy Lin, and Donald Metzler. A Cascade Ranking Model for Efficient Ranked Retrieval. Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2011), page 105-114, July 2011, Beijing, China.
Lidan Wang, Donald Metzler, and Jimmy Lin. Ranking Under Temporal Constraints. Proceedings of 19th International Conference on Information and Knowledge Management (CIKM 2010), pages 79-88, October 2010, Toronto, Canada.
Lidan Wang, Jimmy Lin, and Donald Metzler. Learning to Efficiently Rank. Proceedings of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2010), pages 138-145, July 2010, Geneva, Switzerland.

Broader Impacts

The Ivory toolkit, which contains implementations described in many of the above papers, is available as open-source software and can be downloaded here.
OptTrees, implementations of architecture-conscious regression trees for learning-to-rank, is available as open-source software and can be downloaded here.

Disclaimer

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the researchers and do not necessarily reflect the views of the National Science Foundation. Please contact the PI for additional information.