In the previous post about random name generation (which can be found here) I have mentioned an improvement of the described method, and now I have it implemented.
I have modified the original implementation, so that it uses the last two letters for choosing the new letter. I also had to modify the first phase of the algorithm which makes the statistical representation of the "language", now it calculates the probabilities that a letter in a word follows two previous letters.
The names generated with this modified method are more similar to the original sample names, and it happens more often, that it is one of the samples - though it can be helped by using more samples. The length of the generated names is also more close to the average length of the samples, so I have removed the part that generates a new name until the length is neither too short nor too long.
The source code is available as before, and you can try it with the previous sample names for comparison:
Procedural content generation is usually used for making maps, terrains, plants, etc., but names can also be generated procedurally.
Back in university I studied language technology among others, and during this course we were taught an interesting method for generating random words based on the statistical characteristics of a given language. These words looked similar to the real words of the language at first sight, but they did not necessarily belong to the language.
This method consists of two phases. In the first phase we make a statistical representation of the words of the language. For this we need a sample set of words. We have to iterate through every word in the set, and calculate the probability that a letter in a word follows another letter. For example if we have only the word 'abacac' in the set, then the probabilities are:
a word starts with letter 'a' with probability 1
letter 'a' is followed by 'b' with prob. 1/3 and by 'c' with prob. 2/3
letter 'b' is followed by 'a' with probability 1
letter 'c' is followed by 'a' with probability 1/2, and ends the world with probability 1/2
In the second phase we can generate words based on the probabilities. First we choose a starting letter, and then we repeatedly choose a letter that can follow the previous letter until we choose to end the word (if possible). Staying with the previous example:
we have to chose 'a' as the first letter
the second letter can be 'b' with prob. 1/3 and by 'c' with prob. 2/3; if we throw a dice we choose 'b' if the number is 1 or 2, and 'c' otherwise; let's say we got 4, so our word so far is 'ac'
'c' can be followed by 'a' or end the word; 1 to 3 on the dice means we choose 'a', 4 to 6 means the word ends; let's say the number is 2, thus our word becomes 'aca'
we have to choose a letter that follows 'a' again, let's say we threw 6, so our word is 'acac' now
now a successor to 'c' is needed again, suppose our number is 5, so we are finished, our final word is 'acac'
This example does not show the power of this method, it's just to demonstrate the algorithm. With a large set of words much better results can be achieved.
The method can be simply used to generate names, this can be really useful for games. For example we know lots of elven names, and with this algorithm we can generate similar random names that we can give to NPCs or help the player choose a name. It can also be used for names of cities, animal races, etc.
Since sometimes the names can be too short or very long based on the samples, my implementation generates a new name until a name is found which is neither too short nor too long.
An improvement of this method might choose a letter based not on the last generated character, but on the last 2, 3, ...N characters. This way the words are more similar to the original words, but more samples are needed to achieve good results.
I have implemented this method in Javascript, so that it can be embedded into the blog. The source code is available for download.
You can also try it: enter the sample names into the textbox separated by spaces, and click 'Generate new name' to (surprisingly) generate a new name.