Markov Chain Algorithm

Markov documents:   index   overview    markov    justification   turnin-instructions    code examples     partners     importing classes     MultiSet Documentation

Background

The idea of using this algorithm came from Kernighan and Pike's The Practice of Programming, chapter 3.

Purpose

Running the Markov algorithm with prefix length n=2 on the text of L. Frank Baum's The Wonderful Wizard of Oz might result in something like the following.

Dorothy lived in the doorway and looked upon her right foot and said: "Oz has sent for you." So the Wizard in the basket. "Good-bye!" "Good-bye!" shouted everyone, and all eyes were turned upward to where the Great Oz." "Oh, indeed!" exclaimed the man. "He has more brains than he needs." "And I am something terrible." "But, I don't care who kills her. But until she is melted," explained the Scarecrow. He had been cut. "There," said he; "now you have earned the brains you will oil the joints of my legs, I shall know everything."

As you can see, the generated text makes little sense, but it seems to have similar grammar, sentence structure, and dialogue as the original text. It approximates natural language in structure, if not in sense. The generated text has similar statistical properties to the original text: every three-word phrase in the output appeared somewhere in the input.

The algorithm builds statistics about the frequency with which words follow certain prefix groups of other words. For example, with prefix size n=1, each generated word is randomly chosen based on the probability that it may follow the previous word, as observed from the input text. With n=2, the next generated word is based on the previous two words, and so on. Small values of n tend to produce gibberish; large ones tend to reproduce the input exactly. It is possible to think of running the algorithm at n=0, which would yield a truly random order of words from the input text. Your implementation is not required to work for n=0..

This idea of "chains" puts a few serious constraints on the generated text. For example, the output will always begin with the first n words of the input. Similarly, if allowed to run to completion the algorithm will always end on the same word as the input. The cause of this behavior will become clear later.

Definitions

If words w1, w2, ..., wn, wn+1 occur consecutively in the input, then wn+1 is said to be a suffix of the prefix w1, w2, ..., wn. Remember, n is called the prefix length.

If the same prefix-suffix pair exists x times in the input, then that suffix is said to have multiplicity x. Multiplicity of the suffix is important during text generation because suffixes are chosen according to their probability of following a given prefix as observed from the input. A suffix with multiplicity 2 has twice the probability of being chosen as a suffix with multiplicity 1.

Concrete Example

Examine this famous passage (often called "The Beatitudes")  from the New International Version of the Bible. It is our input. Notice the recurrence of certain phrases, coupled with the uniqueness of others.

Blessed are the poor in spirit,
for theirs is the kingdom of heaven.
Blessed are those who mourn,
for they will be comforted.
Blessed are the meek,
for they will inherit the earth.
Blessed are those who hunger and thirst for righteousness,
for they will be filled.
Blessed are the merciful,
for they will be shown mercy.
Blessed are the pure in heart,
for they will see God.
Blessed are the peacemakers,
for they will be called sons of God.

Matthew 5:3-9

Let's build a table of prefixes and suffixes when n=2 by considering only the prefixes that come from the first and last sentences (otherwise the table would span several pages in this document), but we list corresponding all suffixes from the passage that match those prefixes.

It is clear that any generated text must start with "Blessed are", since those are the first n words of the input. However, to build our table, every word must have a prefix, and every n-word phrase must have at least one suffix. The prefixes for the first "Blessed" and "are" spill off the front of the paragraph; similarly, the suffix of "of God." spills off the end. Rather than make our algorithm's flow control more complex to handle these boundary cases, we shall introduce artificial data.

Let NONWORD represent some word that is not allowed to occur in the input. Then consider any words that need to exist but are outside of the paragraph to be NONWORD. Then we can say that the first two prefixes from this passage are "NONWORD NONWORD" and "NONWORD Blessed". Given prefix length n, we will need n artificial prefixes, the first with n NONWORDs, the second with n-1, and so on. Similarly, we shall say the suffix of "of God." is "NONWORD". The end result is that to begin text generation, we use a prefix of n NONWORDs, and the algorithm terminates normally if NONWORD is ever chosen as the next output word.  In order to keep output from our tests from getting too long, we place an artificial limit on the number of generated words (this is the third parameter of the Markov constructor).

Now we can look at the table.

Prefix Suffixes
NONWORD NONWORD Blessed
NONWORD Blessed are
Blessed are the (multiplicity 5)
those (multiplicity 2)
are the poor
meek,
merciful,
pure
peacemakers,
the poor in
poor in spirit,
in spirit, for
spirit, for theirs
for theirs is
theirs is the
is the kingdom
the kingdom of
kingdom of heaven.
of heaven. Blessed
the peacemakers, for
peacemakers, for they
for they will
they will be (multiplicity 4)
inherit
see
will be comforted.
filled.
shown
called
be called sons
called sons of
sons of God.
of God. NONWORD

You can see that some prefixes can lead to many different choices for the output text, while some have only one choice for the next word. For example, "the poor in spirit, for theirs is the kingdom of heaven." is a chain with no alternate branches. However, the prefix "Blessed are" has two possible next words,  and "are the" has five. Also notice that "the" follows "Blessed are" more often than "those". Hopefully, you can now see why the first n words of the output are always the same as the input.

Algorithm Description

Now that we've defined prefixes, suffixes, and NONWORD, we can give pseudo-code for the Markov Chain algorithm. Note that there are no special cases in the flow control; the NONWORD data takes care of that.

// build data structures
w[1]...w[n] = NONWORD...NONWORD; // beginning of input text
while( there are more input words ) {
   w[n+1] = next word;
   add w[n+1] to the MultiSet of suffixes of w[1]...w[n];
   w[1]...w[n] = w[2]...w[n+1];
}
add NONWORD to the MultiSet  of suffixes of w[1]...w[n]; // end of input text

// output new text
w[1]...w[n] = NONWORD;
w[n+1] = the one suffix associated with a prefix that is all NONWORDs;
while( w[n+1] != NONWORD ) {
   output w[n+1];
   replace the prefix w[1]...w[n] by the prefix w[2]...w[n+1];
   w[n+1] = a randomly chosen suffix of w[1]...w[n];
   /* note: The random choice should be weighted based on the
      multiplicity of each suffix. MultiSet's findKth() method takes care of this for you. */
}

Implementation

The pseudo-code makes fairly clear the procedural necessities of the algorithm. It should also be clear that you will need to utilize several different data structures:

Most of these design decisions have been made for you, as described  in the overview document.