Reg Exp
Web Design
Notes Client
Remove Duplicate Words (Updated)
As a follow-up to the posting about removing consecutive repeated words in a text string, we had someone asking about removing all duplicated words in a text string. So, if you have a string "one one two two three" then it should be shortened to "one two three". Well, we took on the challenge.

First off, keep in mind that the other regular expression removes consecutive repeated words. So, what you need to do is make sure that any repeated words are going to be next to each other. To do this, take the field value, then split it based on spaces into an array, sort the array in alphabetical order, then join it back again with a space separator.

function removeDuplicates(field) {
    var temp = field.value;
    var array = temp.split(" ");
    temp = array.join(" ");

Next, use the regular expression to remove consecutive repeated words. The regular expression handles only one duplicate at a time, so we use a loop to go through until we haven't made any changes. We check the "haven't made any changes" criteria by using two variables - a "before" and an "after". If the length of the two variables is the same, then no changes were made to the string and we exit the loop.

    do {
        var newTemp = temp;
        var temp = newTemp.replace(/\s(\w+\s)\1/, " $1");
    } while (temp.length != newTemp.length);

While doing some testing, I noticed that the regular expression would only replace items where there were spaces both before and after the word. This means that if the first word in the string was immediately repeated, it wouldn't be found. The same for the last word in the string. So two additional regular expressions are needed to take care of that. (It's probably possible to do all this in one statement, but this is easier to follow and doesn't take up too much in terms of resources).

    temp = temp.replace(/^(\w+\s)\1/, "$1");
    temp = temp.replace(/(\s\w+)\1$/, "$1");

The first line replaces the 2nd word in the string if it is a duplicate of the first word. (Technically, it replaces the grouping of the first and second word with just the first word). The second line replaces the last word in the string if it is a duplicate of the 2nd to last word.

Once all the duplicates have been removed, we put the value back into the string. However, our temporary variable has all the words sorted in alphabetical order because of the sorting we did. If we want to keep the original string order, then we need to do some additional work. The best way to do this is to split the original string into an array. Then go through the array elements one at a time. If the array element is part of the shortened string ("temp" in the code above) then it's put back into a value to ultimate go into the field. To assure duplicates are removed, the shortened string will be shortened even further every time an array element is processed (that array element will be removed from the string).

    var orig = field.value.split(" ");
    var finalStr = "";
    for (var i=0; i<orig.length; i++) {
        if (temp.indexOf(" " + orig[i] + " ") != -1) {
            finalStr += orig[i] + " ";
            temp = temp.split(" " + orig[i] + " ").join(" ");
        } else if ((temp.indexOf(orig[i]) != -1) && (temp.indexOf(" " + orig[i]) == (temp.length-orig[i].length-1))) {
            finalStr += orig[i] + " ";
            temp = temp.substring(0, (temp.length-orig[i].length-1));
        } else if (temp.indexOf(orig[i] + " ") == 0) {
            finalStr += orig[i] + " ";
            temp = temp.substring(orig[i].length+1, temp.length);
        } else if (temp == orig[i]) {
            finalStr += orig[i];
            temp = "";

There are four checks - the first one checks for the string somewhere in the middle of the temporary value. The second checks for the string at the end of the temporary value. The third checks for the string at the start of the temporary value. The final check is for an exact match (all that's left of the temporary string is the array value we're looking for).

After this block of code, the variable "finalStr" has one instance of each unique word, in the order they originally appeared. Notice, however, that a space is always placed after each word is added to the string. So there's going to be an extra space at the end of the string that needs to be removed. That is done next.

    if (finalStr.substring(finalStr.length-1, finalStr.length) == " ") {
        finalStr = finalStr.substring(0, finalStr.length-1);
    field.value = finalStr;

The very last statement puts the shortened string back into the field as its new value. So, this function takes in a pointer to a field and the function removes all duplicate words from the field value.

Try the example below to see how it works. Note that the example that showed the bug was "mama to mama a tata to tata" - give that one a try and it will result in "mama to a tata" as it should.