There, I said it.
But it doesn’t rock at everything…
MySQL’s full-text searching is good, but oftentimes it leaves a lot to be desired. Recently, I was trying to get a query going that would offer suggestions for similar job titles, or at least the closest matching ones. In my quest for an elegant and simple solution, I took a trip around the virtual world:
- Russia (http://en.wikipedia.org/wiki/Vladimir_Levenshtein and his lovely distance algorithm)
- Finland (http://en.wikipedia.org/wiki/Michael_Widenius and MySQL full-text search)
- Egypt (http://sphinxsearch.com/)
…and back again, but nothing seemed to fit really well.
And then, I found the Jaro-Winkler distance algorithm and a wonderful implementation of it as a MySQL custom function. The Jaro-Winkler distance is – I feel – superior to the Levenshtein distance for short texts and names.
Using Jaro-Winkler distance with MySQL
For the purposes of this blog-post, I googled for “database of names download” and found this database of 16927 Cat Names. Here is the data sitting nicely in a MySQL table named CatName and the Jaro-Winkler function as a SQL dump:
Import that script into MySQL if you want to follow along.
So, now that we have all our ducks in a row – we can start playing around with the Jaro-Winkler function.
As you can see, the jaro_winkler_similarity function is returning a value between 0 and 1 for the similarity between two texts. Now, let’s search through our database of cat names to find a cat name like “Mr Whiskers”:
The results are wonderful, but note the time it took… OVER 20 SECONDS on a quad-core Macbook Pro.
Well, that’s not great – is it Mr Whiskers?
The reason why it’s so slow is because we’re doing a Jaro-Winkler analysis on every record in our database. That’s not terribly efficient. We only want to quantify how similar two similar words are. Consider the next example:
As you can see, we’re getting roughly the same results back, but in 0.28 seconds! How did we do it? We’re using MySQL’s amazing “LIKE” function to find similar words in the table, and then we only run the Jaro-Winkler distance algorithm against those returned records.