Back in university I studied language technology among others, and during this course we were taught an interesting method for generating random words based on the statistical characteristics of a given language. These words looked similar to the real words of the language at first sight, but they did not necessarily belong to the language.

This method consists of two phases. In the first phase we make a statistical representation of the words of the language. For this we need a sample set of words. We have to iterate through every word in the set, and calculate the probability that a letter in a word follows another letter. For example if we have only the word 'abacac' in the set, then the probabilities are:

- a word starts with letter 'a' with probability 1
- letter 'a' is followed by 'b' with prob. 1/3 and by 'c' with prob. 2/3
- letter 'b' is followed by 'a' with probability 1
- letter 'c' is followed by 'a' with probability 1/2, and ends the world with probability 1/2

In the second phase we can generate words based on the probabilities. First we choose a starting letter, and then we repeatedly choose a letter that can follow the previous letter until we choose to end the word (if possible). Staying with the previous example:

- we have to chose 'a' as the first letter
- the second letter can be 'b' with prob. 1/3 and by 'c' with prob. 2/3; if we throw a dice we choose 'b' if the number is 1 or 2, and 'c' otherwise; let's say we got 4, so our word so far is 'ac'
- 'c' can be followed by 'a' or end the word; 1 to 3 on the dice means we choose 'a', 4 to 6 means the word ends; let's say the number is 2, thus our word becomes 'aca'
- we have to choose a letter that follows 'a' again, let's say we threw 6, so our word is 'acac' now
- now a successor to 'c' is needed again, suppose our number is 5, so we are finished, our final word is 'acac'

This example does not show the power of this method, it's just to demonstrate the algorithm. With a large set of words much better results can be achieved.

The method can be simply used to generate names, this can be really useful for games. For example we know lots of elven names, and with this algorithm we can generate similar random names that we can give to NPCs or help the player choose a name. It can also be used for names of cities, animal races, etc.

Since sometimes the names can be too short or very long based on the samples, my implementation generates a new name until a name is found which is neither too short nor too long.

An improvement of this method might choose a letter based not on the last generated character, but on the last 2, 3, ...N characters. This way the words are more similar to the original words, but more samples are needed to achieve good results.

I have implemented this method in Javascript, so that it can be embedded into the blog. The source code is available for download.

You can also try it: enter the sample names into the textbox separated by spaces, and click 'Generate new name' to (surprisingly) generate a new name.

Generated name:

## No comments:

## Post a Comment