Markov Chain Programming Assignment Overview

Markov documents: index overview markov justification turnin-instructions code examples partners importing classes MultiSet Documentation

Assignment Constraints

You may talk with others about the ideas in this assignment, but you must write all code with your partner. I recommend that you not split up the work, but do it together with the weaker programmer as "driver" most of the time, and with the driver never being willing to do things without first understanding them. You may (and are encouraged to) talk to other students and get help with debugging, but do not let other students write the code for you.

Objectives

You should

Gain experience with reading specifications of data structures and using them without being concerned with how they are implemented
Become familiar with the JDK's Collections Framework included in the java.util package
Understand the Markov Chain algorithm for constrained random sequence generation
Experience non-trivial interaction between the data structures and the algorithm in the text-justification portion of the assignment

Your Task

Your task is to implement two different algorithms, which will be performed in sequence.

The first is the Markov Chain algorithm. (Follow the link for an in-depth description.) It takes as input some sequence of "words" and generates as output a constrained random sequence composed from some or all of the original "words". The new sequence has similar statistical properties to the original sequence.

The second algorithm is a Text Justification algorithm. (Follow the link for an in-depth description.) This algorithm should take as input some intermediate data representation from the Markov Chain algorithm and print that data to the console so that it is both left and right justified.

General Program Specification

You are to write a Java application in a class named Markov. The Markov constructor should take 4 parameters (in the following order):

The name of the text file that will be used to generate the random text.
The prefix length for the Markov Chains (n).
The maximum number of words that the output should contain.
The maximum number of characters per output line. This is the justification width.

For example, the command

Markov markov = new Markov("texts\\Oz.txt", 3, 400, 75);

generates random text (written to System.out) based on the contents of the file Oz.txt (from the texts folder) with Markov Chains of length 3. It generates text until the Markov Chain algorithm naturally terminates or until 400 words are generated, whichever comes first. It justifies the output so that each output line occupies exactly 75 characters (except the last line).

Note that the above example implies that the Markov class is in the default package, and that the Oz.txt file is in the same folder as Markov.class.

A "word" is defined for the purpose of this algorithm as any sequence of non-whitespace characters, so you will not need to treat punctuation or capitalization in any special way. Thus the following will be considered to be four different words:

book
Book
"Book
book,

Data Structures Specification

You are required to use the data structures specified in this document for pedagogical reasons. Later in this course and in life you will have a chance to choose the data structures you think best fit the problem or implement your own, but at this point it is important to make sure you can figure out how to use those specified here.

Note that this section will make more sense after you have read the in-depth descriptions of the Markov Chain and Text Justification algorithms. You should skim this document now, and read it again more carefully after reading those documents

Current Prefix. In your implementation of the class that represents the current prefix, use a FixedLengthQueue object as an instance variable that holds the actual words in the prefix. A FixedLengthQueue should have a constructor that takes a single argument–an integer that specifies the length of the queue. The class needs to support a method add(). This method should take a String as its sole argument and add it to the end (rear) of the queue. If the queue is full, that is, if it has reached its specified length, the add() method should also remove the String from the beginning (front) of the queue in order to maintain the length. Finally the queue should have a toString() method that returns a single String containing each entry in the queue, from front to back, separated by a space. Whenever you move on to the next prefix during the generation of the table of prefixes and their suffixes, you can simply add the new word to the FixedLengthQueue to obtain the next prefix.
Table and Table Entries. For the table of prefixes and their associated suffixes, use a java.util.HashMap object. You must be sure that the keys (which should be Markov prefixes) have a sensible public int hashCode() method. On pp. 232-236, Weiss has a section entitled "Implementing equals and hashCode". The hashCode() example in that section should be useful to you. Each entry in the table (HashMap) will have a key that is a prefix and an associated value that IS-A (or possibly HAS-A depending on what you want to do) MultiSet that contains the suffixes that go with the given prefix.
Suffix Collections and Suffixes. For each prefix, you will build the collection of its suffixes. This collection should be (or have) a MultiSet, a class that we provide. Adding these files to the project is described in the importingEclipseClasses.htm document. You should not upload these files to AFS as part of your turnin; our test script will copy them into your turnin folder before it compiles your code. Each item in a MultiSet will represent one suffix that corresponds to the given prefix. This should allow fast insertion of new items and fast search for the K^th item when the random number generator tells you to use the K^th suffix. A trivial sample program that uses a MultiSet is included in examples.html.
Text Justification. Use the java.util.LinkedList class along with an array or ArrayList as described in the Text Justification algorithm description.

Efficiency and Memory Use

If the number of output words is small compared to the text from which the output is constructed, we would expect that most of the program's time would be spent building the table of prefixes and their suffixes, and not much time producing the output.

It is possible that the default amount of memory allocated by the Java interpreter will not be enough. To increase to 100 (or substitute your own number) megabytes, use the -Xmx option when you invoke the interpreter: "java -Xmx100m Driver 0". This -Xmx100m option will also be used when we test your program. To enter this option in Eclipse, choose Run As→Run... , click on Arguments. Under VM arguments, enter -Xmx100m.

Sample Data Files

Before you begin testing your program, you should create your own small data file, designed to produce a relatively small data structure that still provides choices for the random sequence generator. The words in the file do not have to make sense. Once you are confident that your program is working, you will want to try it on larger files that will produce more interesting random text.

Some larger texts on which you may test your program are collected in the texts folder. Due to memory requirements, you may not be able to process the longer ones with prefix-lengths greater than two, but a prefix length of two gives fairly interesting results. Try some of the smaller files with larger prefix lengths.

If you have a text that produces particularly interesting random sentences, please send it to me, and I make it available to other students.

Milestones

This assignment will be graded twice. For the first milestone, your code must generate appropriate Markov chained output, but that output does not have to be justified. Grading for this part will treat any amount of white space in the output as a delimiter. The Markov constructor should take the four arguments specified above, but for now it can ignore the last argument, justification width.

The second milestone is the completed program, Markov plus justification.

Running Your Program for Grading Purposes

We will run a set of JUnit tests similar to those you were given.

Questions and Answers

These questions and answers will probably not make much sense if you have not yet read the Markov Chain and Text Justification algorithm descriptions. I have given a variation on this assignment before, and these questions come from my archive of email and newsgroup questions from that term.

Questions	Answers
How can we choose a random suffix from a `MultiSet`?	Generate a random number, k, between 0 and size(), then use the `findKth` method from the `MultiSet` class.
Can we assume that the length of the line being outputted is longer than the length of the prefix?	I presume you mean "the first prefix". Since that prefix contains only one word, the answer is yes. You may assume that no input word is longer than the line length.
Is it OK to not align the right side of last line of output with the other lines or is this mandatory?	It would look pretty ugly if you did! The last line should be left-justified, of course, but definitely not right-justified (unless that happens naturally without adding any extra spaces).
Also, will the text file name given at the command prompt include the ".txt"?	Yes. Whatever the full filename is, it should be included in the call to the constructor. If it's in a different directory than `Markov.class`, a pathname must be given, of course.
Do you really mean that there should be two spaces after every period, question mark, etc?	No. Only if the period, question mark, or exclamation point ends a word (defined for this assignment as a sequence of non-whitespace characters).
How do I know whether a character is a whitespace character?	One approach is to use the `Character.isWhiteSpace` method. It would be very nice to set up a StringTokenizer object that considers all whitespace characters as delimiters. This would be difficult if you wanted all Unicode whitespace characters; perhaps it is easier if you restrict your attention to ASCII whitespace characters, which you are allowed to do. Also see examples.html for an explanation of a `String`'s `split` method.
Are we ever supposed to call the `hashCode()` method that we must write?	No. The `HashMap` methods will call it when needed.
Can work on the justification part before the Markov algorithm is working?	Definitely. I suggest working on the two parts "in parallel". If you are working on the justification part, you can write a very simple `main()` method to test it. All that this program must do is (like Markov) feed your justification routine a word at a time, so that the output routine can justify and print the words from a line whenever that line gets filled. Your temporary program can simply bypass the Markov Chain part and justify the original input text.
What's the worst behavior my program can exhibit (this question came from me, not from a student)?	An infinite loop. You should make sure that your program always terminates and gives some kind of output to show what it is doing, even if that output is not correct.
What is the point of Markov, besides the DataStructures practice? Most of our programs up to this point could have some use in the real world. I am just not sure what, if anything, Markov can be used for.	A very good question. Let me give you an analogy. At one point, I started taking piano lessons. Something I had always planned to do "when life slows down a bit." (I finally decided that it never will, and that if I don't do it now, I'll never get around to it). Then life got REALLY busy and I gave it up, but I WILL get back to it someday! No, really :) What did my teacher have me spend most of my time doing? Scales and warm-up exercises. I could ask "Will I ever use this in the real world?" What concert pianist or even bar room pianist would ever play a scale or a warm-up exercise for an audience? My reaction could be to say that I shouldn't have to spend time practicing scales, etc. because they are not "practical." But fortunately I did not do that, and I saw that practicing those boring pieces with "no real-world application" actually made me more agile at playing in general, so that I could play the (quite simple) "real" pieces better than if I had never practiced the scales. This course is a prerequisite for many other CSSE courses. Mostly, it its the CSSE equivalent of scales and warm-ups. Hopefully by the time you are done this course, you will be able to run up and down the scales at a rapid speed, so that later when you do applications (in Operating Systems and other courses, and presumably after graduation), you will be able to concentrate on the problems you are solving, because grinding out the code will now be easy. By the end of this course I want you to be able to use data structures almost as effortlessly as a concert pianist plays a D major scale . If that process allows you to do some interesting programs with immediate applications along the way, that's a bonus. But it's not the only point of the programming part of this course.