Trigrams and doing fuzzy searches with C#

I’ve been going through the 7 Database and 7 Languages in 7 Weeks books (I know, I’m way behind – I’ve been busy dangit!). In there, for PostgreSQL, I learned about trigrams and how they use those to do “fuzzy searches”, or inexact searches.

The Problem:
Now, I have a couple of “Save” directories on a few hard drives full of old backups of backups. So, there are a ton of old files, many with slightly different names. So, how could you search through all of those and find “similar” file names – not necessarily exact matches? Sure, you could search with “*expense*.*” but that can be time-consuming. I have a computer, can’t I put it to work to figure out these tedious things?

Well, one of the cool things about being a programmer is that I can write a program to solve my own problem! In this case, it’s trigrams to the rescue!

What are Trigrams:
Trigrams in computing are the result of when you take a string, and take 3 characters from it, move over a position, take the next 3 characters, etc. You do this until you reach the end of the string. In the end, you have a collection of 3-character trigrams, for that input string. For example:

mx3A1D

This happens to be a directory, but the string could be anything – the concept is the same. So, you’re saying “so what! what’s the use of that crazy list?”

How to use trigrams:
Well, here is how you might use these. Imagine you have one file, in this case – and you want to search all subdirectories for potential matching “candidate” file names. For this example, I created a bunch of files in the same folder for simplicity, but it doesn’t matter, they can be wherever.

mx3CFE4

the file I want to match, is highlighted: “The quick brown fox jumped over the lazy dog.txt”. Now, I want to see how many of the other files are similar and to what degree.

What I do is this: first, find all the trigrams for the filename that I have. Then, look at each other file name, and see how many of those trigrams are contained within the candidate file name. It might be 0 or maybe every single trigram matches (because it’s the same file name). The point is, some number of trigrams will “hit”.

In other words, I go file by file – and for each file name, I check to see how many of the trigrams in that file name. The more that match, the more similar that file name is.

So first, here is how I enumerate all the trigrams for a particular string:

mx3517E

Then I can go test the collisions with something like this:

mx3211

if I output that to the screen, I see:

mx3EB20

As you can see, the file with exactly the same name was a 100% match and all the others were some degree less. This is a pretty powerful concept, I can actually put a tangible number of just how “close” of a match one string is to the another!

How is this useful?
For me, this means that I can set a threshold, perhaps 85% or greater. I can run my program and say, for this file, show me any file anywhere on the file system that has a name that is at least 85% similar – and it’s just this simple to do!

I wasn’t aware of this technique before this book. I’m also pretty sure I wouldn’t have thought of it on my own either, so this was a pretty cool find and something that is actually making my life simpler.

Posted in Machine Learning, Uncategorized
One comment on “Trigrams and doing fuzzy searches with C#
  1. […] Seder wrote a great post of trigrams last week.  He then asked me how the same functionality would be implemented in F# – […]

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Archives
Categories

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 5 other followers

%d bloggers like this: