Introduction to "Cloud Computing" (Fall 2008)

Project ES2: Mining and Analytics for IBM Intranet Search

Shivakumar Vaithyanathan
IBM Almaden Research Center

9:30am, October 22, 2008
Hornbake 2119

[Slides in PDF]

Abstract

The IBM Intranet consists of more than 100M pages and serves a user community of 350,000 people distributed across multiple continents. The goal of Project ES2 is to build a scalable and easily maintainable search engine for this intranet. To tackle this search problem, ES2 employs sophisticated offline analytics combined with intelligent runtime query matching. In the offline phase, pages crawled from the intranet are pushed through a multi-stage pipeline consisting of "local" (page-at-a-time) analysis, "global" (cross-page) analysis, followed by intelligent tokenization and indexing. In addition, the pages are continuously mined to discover new patterns for local and global analysis. In this talk, I will describe the overall ES2 flow providing examples of local / global analysis and associated mining tasks. ES2 was originally deployed as a single-machine solution but to meet growing demands is currently being migrated to a 30-node Hadoop cluster. This leads to several challenging research issues, both in translating the overall flow onto Hadoop, as well as in mapping the analysis and mining algorithms onto the MapReduce paradigm. I will illustrate these challenges by walking through the mapping of a few example algorithms.

About the Speaker

Shivakumar Vaithyanathan is the Sr. Manager at IBM Almaden responsible for Analytics, Search and Information Integration Research. He is an Associate Editor for the Journal of Statistical Analysis and Data Mining.

Back to main page

Sponsored by Amazon Web Services Creative Commons: Attribution-Noncommercial-Share Alike 3.0 United States Valid XHTML 1.0! Valid CSS!
This page, first created: 16 Oct 2008; last updated: