I’ve been going through the 7 Database and 7 Languages in 7 Weeks books (I know, I’m way behind – I’ve been busy dangit!). In there, for PostgreSQL, I learned about trigrams and how they use those to do “fuzzy searches”, or inexact searches.
Now, I have a couple of “Save” directories on a few hard drives full of old backups of backups. So, there are a ton of old files, many with slightly different names. So, how could you search through all of those and find “similar” file names – not necessarily exact matches? Sure, you could search with “*expense*.*” but that can be time-consuming. I have a computer, can’t I put it to work to figure out these tedious things?
Well, one of the cool things about being a programmer is that I can write a program to solve my own problem! In this case, it’s trigrams to the rescue!
What are Trigrams:
Trigrams in computing are the result of when you take a string, and take 3 characters from it, move over a position, take the next 3 characters, etc. You do this until you reach the end of the string. In the end, you have a collection of 3-character trigrams, for that input string. For example:
This happens to be a directory, but the string could be anything – the concept is the same. So, you’re saying “so what! what’s the use of that crazy list?”
How to use trigrams:
Well, here is how you might use these. Imagine you have one file, in this case – and you want to search all subdirectories for potential matching “candidate” file names. For this example, I created a bunch of files in the same folder for simplicity, but it doesn’t matter, they can be wherever.
the file I want to match, is highlighted: “The quick brown fox jumped over the lazy dog.txt”. Now, I want to see how many of the other files are similar and to what degree.
What I do is this: first, find all the trigrams for the filename that I have. Then, look at each other file name, and see how many of those trigrams are contained within the candidate file name. It might be 0 or maybe every single trigram matches (because it’s the same file name). The point is, some number of trigrams will “hit”.
In other words, I go file by file – and for each file name, I check to see how many of the trigrams in that file name. The more that match, the more similar that file name is.
So first, here is how I enumerate all the trigrams for a particular string:
Then I can go test the collisions with something like this:
if I output that to the screen, I see:
As you can see, the file with exactly the same name was a 100% match and all the others were some degree less. This is a pretty powerful concept, I can actually put a tangible number of just how “close” of a match one string is to the another!
How is this useful?
For me, this means that I can set a threshold, perhaps 85% or greater. I can run my program and say, for this file, show me any file anywhere on the file system that has a name that is at least 85% similar – and it’s just this simple to do!
I wasn’t aware of this technique before this book. I’m also pretty sure I wouldn’t have thought of it on my own either, so this was a pretty cool find and something that is actually making my life simpler.