Overview
Project Team
Publications
Software

Providing Relevant and Timely Results: Real-Time Search Architectures and Relevance Algorithms

Project funded by the National Science Foundation (IIS-1218043)
PI: Jimmy Lin, (formerly) University of Maryland

This project began September 2012 and ended September 2016. This website exists for archival purposes only and is not being actively maintained.

Overview

Search remains one of the best solutions today for satisfying users' information needs. However, today we are inundated with increasing quantities of information, with no relief in sight. As data volume on different media grows, so does the velocity — the rate at which information is being generated, transmitted, and consumed. The growing importance of social media such as Twitter further exacerbates this problem. It is clear that better algorithms and systems for managing real-time document streams are needed: both retrospective techniques to handle content that has already accumulated as well as prospective techniques that anticipates future content in a proactive manner.

This project advanced the state of the art in information retrieval by tackling real-time search, and more broadly, addressing retrieval challenges associated with streams of documents and other types of dynamic document collections. From the perspective of intellectual merit, this project has made three main contributions:

  1. The development of high-performance search architectures for low-latency, high-throughput indexing and query evaluation, along with associated storage infrastructure for timestamped document collections.
  2. Methods for extracting temporal signals from streams of documents and temporally-focused ranking algorithms.
  3. The development of a task model, algorithms, as well as an evaluation framework for push notifications, where systems proactively monitor document streams (e.g., social media posts) to identify and deliver those that are of interest to the user.

One important aspect of this project was close coordination with evaluation efforts at the Text Retrieval Conferences (TRECs) sponsored by the U.S. National Institute of Standards and Technology (NIST). Each year, TREC attracts dozens of participants from around the world to work on shared tasks that jointly define the future direction of information retrieval research. This project has developed task models and evaluation methodologies for the TREC Microblog and Real-Time Summarization Tracks, including an innovative "Living Labs" evaluation framework for prospective information needs that take advantage of live users to assess push notification systems. These efforts have had broader impact in helping to steer the overall research direction of the field. One additional significant broader impact of this project is the uptake of research results by industry. As a specific example, Twitter's real-time recommendation system GraphJet, which was deployed in 2014, makes use of results from this project involving memory allocation models for index structures.

Overall, this successful project has contributed much to real-time information access, from both the perspective of effectiveness (systems that delivery high-quality results) and efficiency (systems that deliver results with low latency).

<< back to top

Project Team

Jimmy Lin Jimmy Lin
(formerly) Associate Professor, The iSchool
Yulu Wang Yulu Wang
Ph.D. student, Computer Science, University of Maryland
Hua He Hue He
Ph.D. student, Computer Science, University of Maryland
Jinfeng Rao Jinfeng Rao
Ph.D. student, Computer Science, University of Maryland
Nima Asadi Nima Asadi
Ph.D., Computer Science, University of Maryland
(Graduated Summer 2013)

<< back to top

Publications (Selected)

Yulu Wang and Jimmy Lin. Partitioning and Segment Organization Strategies for Real-Time Selective Search on Document Streams. Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM 2017), February 2017, Cambridge, the United Kingdom.

Jinfeng Rao, Hua He, and Jimmy Lin. Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks. Proceedings of 25th International Conference on Information and Knowledge Management (CIKM 2016), pages 1913-1916, October 2016, Indianapolis, Indiana.

Hua He and Jimmy Lin. Pairwise Word Interaction Modeling with Neural Networks for Semantic Similarity Measurement. Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL/HLT 2016), pages 937-948, June 2016, San Diego, California.

Yulu Wang and Jimmy Lin. The Feasibility of Brute Force Scans for Real-Time Tweet Search. Proceedings of the ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2015), pages 321-324, September 2015, Northampton, Massachusetts.

Hua He, Kevin Gimpel, and Jimmy Lin. Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), pages 1576-1586, September 2015, Lisbon, Portugal.

Jiaul H. Paik and Jimmy Lin. Do Multiple Listeners to the Public Twitter Sample Stream Receive the Same Tweets? Proceedings of the SIGIR 2015 Workshop on Temporal, Social and Spatially-Aware Information Access, August 2015, Santiago, Chile.

Yulu Wang, Garrick Sherman, Jimmy Lin, and Miles Efron. Assessor Differences and User Preferences in Tweet Timeline Generation. Proceedings of the 38th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2015), pages 615-624, August 2015, Santiago, Chile.

Jinfeng Rao, Jimmy Lin, and Miles Efron. Reproducible Experiments on Lexical and Temporal Feedback for Tweet Search. Proceedings of the 37th European Conference on Information Retrieval (ECIR 2015), pages 755-767, March 2015, Vienna, Austria.

Jimmy Lin, Miles Efron, Yulu Wang, and Garrick Sherman. Overview of the TREC-2014 Microblog Track. Proceedings of the Twenty-Third Text REtrieval Conference (TREC 2014), November 2014, Gaithersburg, Maryland.

Miles Efron, Jimmy Lin, Jiyin He, and Arjen de Vries. Temporal Feedback for Tweet Search with Non-Parametric Density Estimation. Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2014), pages 33-42, July 2014, Gold Coast, Australia.

Ellen M. Voorhees, Jimmy Lin, and Miles Efron. On Run Diversity in "Evaluation as a Service". Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2014), pages 959-962, July 2014, Gold Coast, Australia.

Yulu Wang and Jimmy Lin. The Impact of Future Term Statistics in Real-Time Tweet Search. Proceedings of the of the 36th European Conference on Information Retrieval (ECIR 2014), pages 567-572, April 2014, Amsterdam, The Netherlands.

Jimmy Lin, Milad Gholami, and Jinfeng Rao. Infrastructure for Supporting Exploration and Discovery in Web Archives. Proceedings of the 23rd International World Wide Web Conference Companion (WWW 2014), pages 851-855, April 2014, Seoul, South Korea. (Temporal Web Analytics Workshop 2014)

Jimmy Lin and Miles Efron. Infrastructure Support for Evaluation as a Service. Proceedings of the 23rd International World Wide Web Conference Companion (WWW 2014), pages 79-83, April 2014, Seoul, South Korea.

Nima Asadi and Jimmy Lin. An Exploration of Postings List Contiguity in Main-Memory Incremental Indexing. Proceedings of the WSDM 2014 Workshop on Large-Scale and Distributed Systems for Information Retrieval, February 2014, New York, New York.

Jimmy Lin and Miles Efron. Evaluation as a Service for Information Retrieval. SIGIR Forum, 47(2):8-14, 2013.

Jimmy Lin and Miles Efron. Overview of the TREC-2013 Microblog Track. Proceedings of the Twenty-Second Text REtrieval Conference (TREC 2013), November 2013, Gaithersburg, Maryland.

Nima Asadi. Multi-Stage Search Architectures for Streaming Documents. Ph.D. dissertation, University of Maryland, College Park, 2013.

Nima Asadi and Jimmy Lin. Fast Candidate Generation for Real-Time Tweet Search with Bloom Filter Chains. ACM Transactions on Information Systems, 31(3), article 13, 2013.

Jimmy Lin and Miles Efron. Temporal Relevance Profiles for Tweet Search. Proceedings of the SIGIR 2013 Workshop on Time-Aware Information Access, August 2013, Dublin, Ireland.

Nima Asadi, Jimmy Lin, and Michael Busch. Dynamic Memory Allocation Policies for Postings in Real-Time Twitter Search. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2013), pages 1186-1194, August 2013, Chicago, Illinois.

<< back to top

Software

Twitter Tools, the software infrastructure for the "track as a service" model implemented in the TREC 2013 and 2014 Microblog tracks, is available here under an open-source license.

<< back to top

Acknowledgments

This work is supported by the National Science Foundation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the researchers and do not necessarily reflect the views of the National Science Foundation.