CSSE 413: Information Retrieval

Overview

The purpose of this assignment is to explore several information retrieval techniques. Please work on this assignment in pairs.

Corpus

The Corpus to be processed. The corpus consists of the brief biographies of the presidents of our country. They were downloaded from the following White House site.

Requirements

Implement the following algorithms and techniques:
  1. BM25 scoring function
  2. Cooccurent term extraction such as skip bi-grams or n-grams
  3. Phrase extraction such as passage term matching or textual alignment

The use of n-grams is illustrated in the Watson article entitled Identifying Implicit Relationships, p 12.2

The use of skip bi-grams, passage scoring and textual alignment are illustrated in the Watson article entitled Textual Evidence Gathering and Analysis.

An article on A Closer Look at Skip-gram Modelling

Assignments

For each of the methods listd above, determine the top ten documents for the following queries. Please indicate the values returned by each of the methods as well as a brief subjective analysis of the strengths and weaknesses of each of the approaches.

Please submit a brief write-up, entitled "Analysis.pdf" in which you document your results, together with your code to the appropriate drop-box on Moodle.