How to Filter Cells that have Duplicate Text Strings (Words) in it

One of my friends works in a healthcare analytics company. He often connects with me on some of the real-life issues he faces while working with data in Excel.

A lot of times, I convert his queries into Excel tutorials on this site, as it could be helpful for my other readers as well.

This is also one such tutorial.

My friend called me last week with the following issue:

There is address data in a column in Excel, and I want to identify/filter cells the where the address has duplicate text strings (words) in it.

Here is the similar dataset in which he wanted to filter cells that have a duplicate text string in it (the ones with red arrows):

Identify Duplicate Text Strings in Excel - Dataset Address

Now what makes this difficult is that there is no consistency in this data. Since this is a compilation of data set that has been manually created by sales reps, there can be variations in the dataset.

Consider this:

  • Any text string could repeat in this dataset. For example, it could be the name of the area or the name of the city or both.
  • The words are separated by a space character, and there is no consistency in whether the city name would be there after six words or eight words.
  • There are thousands of records like this, and the need is to filter those records where there are any duplicate text strings.

After considering many options (such as text to columns and formulas), I finally decided to use VBA to get this done.

So I created a custom VBA function (‘IdDuplicate') to analyze these cells and give me TRUE if there is a duplicate word in the text string, and FALSE in case there are no repetitions (as shown below):

Identify Duplicate Text Strings in Excel - Dataset Address Demo

This custom function analyzes each word in the text string and checks how many times it occurs in the text. If the count is more than 1, it returns TRUE; else it returns FALSE.

Also, it has been created to only count words more than three characters.

Once I have the TRUE/FALSE data, I can easily filter all the records that are TRUE.

Now let me show you how to do this in Excel.

VBA Code for the Custom Function

This is done by creating a custom function in VBA. This function can then be used as any other worksheet function in Excel.

Here is the VBA code for it:

Function IdDuplicates(rng As Range) As String
Dim StringtoAnalyze As Variant
Dim i As Integer
Dim j As Integer
Const minWordLen As Integer = 4
StringtoAnalyze = Split(UCase(rng.Value), " ")
For i = UBound(StringtoAnalyze) To 0 Step -1
If Len(StringtoAnalyze(i)) < minWordLen Then GoTo SkipA
For j = 0 To i - 1
If StringtoAnalyze(j) = StringtoAnalyze(i) Then
IdDuplicates = "TRUE"
GoTo SkipB
End If
Next j
SkipA:
Next i
IdDuplicates = "FALSE"
SkipB:
End Function

Thanks Walter for suggesting a better approach to this code!

How to Use this VBA Code

Now that you have the VBA code, you need to place it in the backend of Excel, so that it can work as a regular worksheet function.

Below are the steps to put the VBA code on the backend:

  1. Go to the Developer tab.identify duplicate text strings - developer tab in ribbon
  2. Click on Visual Basic (you can also use the keyboard shortcut ALT + F11)Select Visual basic from the ribbon
  3. In the VB Editor back end that opens, right-click on any of the workbook objects.
  4. Go to ‘Insert' and click on ‘Module'. This will insert the module object for the workbook.Insert Module for the custom VAB code
  5. In the Module code window, copy and paste the VBA code mentioned above.VBA Code in the backend - to identify duplicate text strings

Once you have the VBA code in the back end, you can use the function – ‘IdDuplicates' as any other regular worksheet function.

This function takes one single argument, which is the cell reference of the cell where you have the text.

The result of the function is TRUE (if there are duplicate words in it) or FALSE (if there are no duplicates). Once you have this list of TRUE/FALSE, you can filter the ones with TRUE to get all the cells that have duplicate text strings in it.

Note: I have created the code only to consider those words that are more than three characters long. This ensures that if there are 1, 2, or 3 character-long words (such as 12 A, K G M, or L D A) in the text string, these are ignored while counting the duplicates. If you want, you can easily change this in the code.

This function will only be available in the workbook where you have copied the code in the module. In case you want this to be available in other workbooks as well, you need to copy and paste this code in those workbooks. Alternatively, you can also create an add-in (enabling which would make this function available in all the workbooks on your system).

Also, remember to save this workbook in .xlsm extension (since it has a macro code in it).

You May Also Like the Following Excel Tutorials: