Tuesday, August 23, 2011

Fuzzy matching: a programmer's view of the Damerau-Levenshtein algorithm


The sheer mechanics of fuzzy matching algorithms are fascinating. Optimized variations are behind search engines and optical character recognition (OCR) softwares we see on the market today. An interesting algorithm I've recently used for a patient matching solution utilizes the Damerau-Levenshtein distance algorithm to locate fuzzy matches.

The Levenshetin approach considers how many substitutions, deletions, and insertions are needed to make two strings equal (also known as the edit distance between the strings). The algorithm was later improved by Damerau to include transposition (step highlghted on line 55). This solution is an example of bottom-up dynamic programming, because a matrix is flood filled to find the minimum distance. The complexity is cubic, but is improved to K x L complexity when using a maximum distance of interest (where K is the maximum distance, and L is the shortest string). The algorithm below uses a maximum distance (threshold paramter) and other simple optimizations to vastly improve the performance of the Damerau-Levenshtein. It has been tested to run over 10K string compares/second with two given strings under ten characters in length and a maximum threshold of three edits.

public static int DamerauLevenshteinDistanceImproved(string string1, 
    string string2, int threshold)
{
    // Return trivial case - where they are equal
    if (string1.Equals(string2))
        return 0;

    // Return trivial case - where one is empty
    if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
        return (string1 ?? "").Length + (string2 ?? "").Length;

    // Ensure string2 (inner cycle) is longer
    if (string1.Length > string2.Length)
    {
        var tmp = string1;
        string1 = string2;
        string2 = tmp;
    }

    // Return trivial case - where string1 is contained within string2
    if (string2.Contains(string1))
        return string2.Length - string1.Length;

    var length1 = string1.Length;
    var length2 = string2.Length;

    var d = new int[length1 + 1, length2 + 1];

    for (var i = 0; i <= d.GetUpperBound(0); i++)
        d[i, 0] = i;

    for (var i = 0; i <= d.GetUpperBound(1); i++)
        d[0, i] = i;

    for (var i = 1; i <= d.GetUpperBound(0); i++)
    {
        var im1 = i - 1;
        var im2 = i - 2;
        var minDistance = threshold;

        for (var j = 1; j <= d.GetUpperBound(1); j++)
        {
            var jm1 = j - 1;
            var jm2 = j - 2;
            var cost = string1[im1] == string2[jm1] ? 0 : 1;

            var del = d[im1, j] + 1;
            var ins = d[i, jm1] + 1;
            var sub = d[im1, jm1] + cost;

            //Math.Min is slower than native code here
            //d[i, j] = Math.Min(del, Math.Min(ins, sub));
            d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub;

            if (i > 1 && j > 1 && string1[im1] == string2[jm2] 
                         && string1[im2] == string2[jm1])
                d[i, j] = Math.Min(d[i, j], d[im2, jm2] + cost);

            if (d[i, j] < minDistance)
                minDistance = d[i, j];
        }

        if (minDistance > threshold)
            return int.MaxValue;
    }

    return d[d.GetUpperBound(0), d.GetUpperBound(1)] > threshold
        ? int.MaxValue
        : d[d.GetUpperBound(0), d.GetUpperBound(1)];
}

2 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. Submit your website or blog now for appearing in Google and over 300 other search engines!

    Over 200,000 sites handled!

    Submit RIGHT NOW with I NEED HITS!!!

    ReplyDelete