Research in Information Retrieval


"The Text REtrieval Conference (TREC), co-sponsored by the National Institute of Standards and Technology (NIST) and U.S. Department of Defense, was started in 1992 as part of the TIPSTER Text program. Its purpose was to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. In particular, the TREC workshop series has the following goals:

  • To encourage research in information retrieval based on large test collections.
  • To increase communication among industry, academia, and government by creating an open forum for the exchange of research ideas.
  • To speed the transfer of technology from research labs into commercial products by demonstrating substantial improvements in retrieval methodologies on real-world problems.
  • To increase the availability of appropriate evaluation techniques for use by industry and academia, including development of new evaluation techniques more applicable to current systems"
-Selection from Trec Overview

Summer 2009

In the spring of my freshman year, one of my professors advised me to get involved with some research. Not knowing much about research at Delaware, I emailed my advisor asking what was being done in the department. Serendipitiously, my adviser was looking for an undergraduate research assistant to help him with his current work in Information Retrieval. I began working for him that summer.

My initial job was fairly straightforward: I was to maintain the web interface for the Million Query Track.

The Million Query track's goal is to study evaluation and optimization of information retrieval systems over very many very incompletely judged topics organized into predefined categories.

The web interface was a customized Drupal module that ran to several thousand lines of PHP and several hundred lines of Javascript. It enabled test subjects to quickly process textual documents and determine if they were relevant to a given query. Professor Carterette had only recently come to Delaware, and had brought this system with him; because of this and the fact that Drupal had just upgraded to a newer version, it took a while to get the web interface back up and running.

When we did get it working, it came time to load documents for processing into it. As you might anticipate with a name involving "Million", there were about a million documents that needed to be loaded into the database. Of course, the actual documents we had available to sort through numbered in the billions, so loading was non-trivial. The documents were broken up into less than a hundred archive files, which were each heavily compressed. Our first-stabs at loading were predicted to take a little under a year; by indexing the files for quicker access, we were able to take it down to a few weeks. Finally, by exploiting the parallelized nature of our server, we could further reduce that to only a few days. This was deemed acceptable, and a few days later we were able to successfully load all the documents.

Near the end of the summer, Professor Carterette decided to get us involved in another TREC track: Relevence Feedback

The goal of the relevance feedback track is to provide a framework for exploring the effects of different factors on the success of relevance feedback.

This track required modifications to our web interface, since it was testing some distinct ideas in Information Retrieval Systems.

Fall 2009

Around this time, the summer was ending. Professor Carterette offered to let me do an Independent Study (CISC-366) with him in the Fall so I could continue to be a part of the project. We spent most of the semester keeping the interface going while users processed documents. It was a very hectic time, since the interface developed a mysterious bug right at the last second. Regardless, we managed to get enough data to make the Million Query track worth it.

Because of my assistance on the project, my professor invited me to the TREC conference itself. The conference was at the National Institute for Standards and Technology (NIST) in Maryland. I attended for 3 days, and got to meet several people in the IR field, and listen in on several lectures and discussions about the developments in Information Retreival. It was a very exciting experience for an undergraduate.

Summer 2010

Professor Carterette asked me to return the next summer to continue working with him. This time, however, he moved me more into the theoretical side of the process. Now looking at the Sub-topic Retrieval field in Information Retrieval, I read over several papers in the field. Then, under Professor Carterette's direction, I got a chance to implement some systems based on previous work done in the field, and then to further expand that work.