Funded by the National Science Foundation (IIS-1144034)
PI: Jimmy Lin

Note: This project concluded in August 2014. This website is no longer actively maintained, and is available primarily for archival purposes.

Project Overview

When it comes to search, users desire results that are not only good but also fast. That is, ranking algorithms should be both effective and efficient. These two desiderata, however, are often in tension. To obtain high-quality results, developers typically take advantage of machine learning techniques to rank documents based on a multitude of features (so-called "learning to rank"), e.g., matching term scores, phrase scores, static document features, etc. In the web context, state-of-the-art rankers use hundreds of features or more. In general, the more features an algorithm considers, the better the quality overall. However, considering more features takes more time, particularly for certain types of features that are computationally expensive. Thus, we often observe an inverse relationship between quality and speed.

In the information retrieval community, explorations in effectiveness and efficiency have been largely disjoint. This is problematic in that a piecemeal approach may yield ranking models that are impractically slow on web-scale collections or algorithmic optimizations that sacrifice quality to an unacceptable degree. The aim of our work is to develop an integrated framework to building search systems that are both effective and efficient. To this end, we have been exploring a research program, dubbed "learning to efficiently rank", that allows algorithm designers to capture, model, and reason about tradeoffs between effectiveness and efficiency in a unified machine-learning framework.

Our core idea is to consider the ranking problem as a "cascade", where ranking is broken into a finite number of distinct stages. Each stage considers successively richer and more complex features, but over successively smaller candidate document sets. The intuition is that although complex features are more time-consuming to compute, the additional overhead is offset by examining fewer documents. In other words, the cascade model views retrieval as a multi-stage progressive refinement problem, where each stage balances the cost of exploiting various features with the potential gain in terms of better results. We have explored this notion in the context of linear models and tree-based models.

In addition to the document ranking problem, we have also explored the design of search architectures. Instead of treating the search engine as a monolithic entity, we have experimented with a multi-stage architecture where retrieval is decomposed into a candidate generation stage, a feature extraction stage, and a reranking stage using machine-learned models.

Our findings are detailed in the publications listed below.

Project Team

picture of Jimmy Jimmy Lin
Associate Professor
The iSchool (College of Information Studies), University of Maryland
Hua He Hua He
Ph.D. student
Department of Computer Science, University of Maryland
 
Nima Asadi Nima Asadi, Ph.D., Computer Science
Graduated Summer 2013
Dissertation Title: Multi-Stage Search Architectures for Streaming Documents
Lidan Wang Lidan Wang, Ph.D., Computer Science
Graduated Summer 2012
Dissertation title: Learning to Efficiently Rank

Relevant Publications

Broader Impacts

  • The Ivory toolkit, which contains implementations described in many of the above papers, is available as open-source software and can be downloaded here.
  • OptTrees, implementations of architecture-conscious regression trees for learning-to-rank, is available as open-source software and can be downloaded here.

Disclaimer

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the researchers and do not necessarily reflect the views of the National Science Foundation. Please contact the PI for additional information.