CSSE 413: Information Retrieval
Overview
The purpose of this assignment is to explore several information retrieval
techniques. Please work on this assignment in pairs.
Corpus
The Corpus to be processed. The corpus
consists of the brief biographies of the presidents of our
country. They were downloaded from the following White House
site.
Requirements
Implement the following algorithms and techniques:
- BM25 scoring function
- Cooccurent term extraction such as skip bi-grams or n-grams
- Phrase extraction such as passage term matching or textual alignment
The use of n-grams is illustrated in the Watson article entitled
Identifying Implicit Relationships, p 12.2
The use of skip bi-grams, passage scoring and textual alignment
are illustrated in the Watson article entitled Textual
Evidence Gathering and Analysis.
An article on A
Closer Look at Skip-gram Modelling
Assignments
For each of the methods listd above, determine the top ten documents
for the following queries. Please indicate the values returned by each
of the methods as well as a brief subjective analysis of the strengths
and weaknesses of each of the approaches.
- adams
- lincoln
- president
- assassinated president
- great president
- first president
- civil war president
- Ten queries of your own choosing, highlighting the benefits and
drawbacks of methods 2 - 5.
Please submit a brief write-up, entitled "Analysis.pdf" in which
you document your results, together with your code to the appropriate
drop-box on Moodle.