Trigrams and doing fuzzy searches with C#

I’ve been going through the 7 Database and 7 Languages in 7 Weeks books (I know, I’m way behind – I’ve been busy dangit!). In there, for PostgreSQL, I learned about trigrams and how they use those to do “fuzzy searches”, or inexact searches.

The Problem:
Now, I have a couple of “Save” directories on a few hard drives full of old backups of backups. So, there are a ton of old files, many with slightly different names. So, how could you search through all of those and find “similar” file names – not necessarily exact matches? Sure, you could search with “*expense*.*” but that can be time-consuming. I have a computer, can’t I put it to work to figure out these tedious things?

Well, one of the cool things about being a programmer is that I can write a program to solve my own problem! In this case, it’s trigrams to the rescue!

What are Trigrams:
Trigrams in computing are the result of when you take a string, and take 3 characters from it, move over a position, take the next 3 characters, etc. You do this until you reach the end of the string. In the end, you have a collection of 3-character trigrams, for that input string. For example:

mx3A1D

This happens to be a directory, but the string could be anything – the concept is the same. So, you’re saying “so what! what’s the use of that crazy list?”

How to use trigrams:
Well, here is how you might use these. Imagine you have one file, in this case – and you want to search all subdirectories for potential matching “candidate” file names. For this example, I created a bunch of files in the same folder for simplicity, but it doesn’t matter, they can be wherever.

mx3CFE4

the file I want to match, is highlighted: “The quick brown fox jumped over the lazy dog.txt”. Now, I want to see how many of the other files are similar and to what degree.

What I do is this: first, find all the trigrams for the filename that I have. Then, look at each other file name, and see how many of those trigrams are contained within the candidate file name. It might be 0 or maybe every single trigram matches (because it’s the same file name). The point is, some number of trigrams will “hit”.

In other words, I go file by file – and for each file name, I check to see how many of the trigrams in that file name. The more that match, the more similar that file name is.

So first, here is how I enumerate all the trigrams for a particular string:

mx3517E

Then I can go test the collisions with something like this:

mx3211

if I output that to the screen, I see:

mx3EB20

As you can see, the file with exactly the same name was a 100% match and all the others were some degree less. This is a pretty powerful concept, I can actually put a tangible number of just how “close” of a match one string is to the another!

How is this useful?
For me, this means that I can set a threshold, perhaps 85% or greater. I can run my program and say, for this file, show me any file anywhere on the file system that has a name that is at least 85% similar – and it’s just this simple to do!

I wasn’t aware of this technique before this book. I’m also pretty sure I wouldn’t have thought of it on my own either, so this was a pretty cool find and something that is actually making my life simpler.

Posted in Machine Learning, Uncategorized
One comment on “Trigrams and doing fuzzy searches with C#
  1. […] Seder wrote a great post of trigrams last week.  He then asked me how the same functionality would be implemented in F# – […]

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Archives
Categories

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 9 other followers

%d bloggers like this: