Patrick's Blog: August 2011

Wednesday, August 31, 2011

Bitwise XOR on Byte Array

Spent the past few hours scrambling for a generic XOR solution on two byte arrays. I needed a solution that would add the match value to a specific result value if it did not already exist (assuming the most significant byte starts at the beginning of the array). You'd think this would be somewhere in the C# libraries by now. After an hour of searching, I decided to write my own. Ended up with the solution below.

static byte[] BitwiseXOR(byte[] result, byte[] matchValue)
{
    if (result.Length == 0)
    {
        return matchValue;
    }

    byte[] newResult = new byte[matchValue.Length > result.Length ? matchValue.Length : result.Length];
            
    for (int i = 1; i < newResult.Length+1; i++)
    {
        //Use XOR on the LSBs until we run out
        if(i > result.Length)
        {
            newResult[newResult.Length - i] = matchValue[matchValue.Length - i];
        }
        else if (i > matchValue.Length)
        {
            newResult[newResult.Length - i] = result[result.Length - i];
        }
        else
        {
            newResult[newResult.Length -i] = 
                (byte)(matchValue[matchValue.Length - i] ^ result[result.Length - i]);
        }
    }
    return newResult;
}

Thursday, August 25, 2011

Windows Azure Marketplace

In an attempt to update my blog, I am posting this from my January discussion at the KY .NET user group meeting on consuming cloud services. I will be speaking this October on Sql Azure Federations, and will be focusing my efforts in the coming year on cloud based applications. More to come…

Windows Azure Marketplace: Consuming the Cloud DataMarket
Leveraging the Cloud’s data services can provide real-time feeds that keep your users bustling. Learn how to quickly absorb the DataMarket’s datasets using OData APIs and expose them to client applications on any platform. A sample from Microsoft’s Windows Azure Support team utilizing a Windows Azure service to consume crime data from Data.gov will be discussed. OData, one of the cloud’s increasingly popular and flexible web protocols for querying and updating data, will be examined and reviewed.

Link to slides:

PPTX: Windows_Azure_Marketplace_Consuming_the_Cloud

Tuesday, August 23, 2011

Fuzzy matching: a programmer's view of the Damerau-Levenshtein algorithm

The sheer mechanics of fuzzy matching algorithms are fascinating. Optimized variations are behind search engines and optical character recognition (OCR) softwares we see on the market today. An interesting algorithm I've recently used for a patient matching solution utilizes the Damerau-Levenshtein distance algorithm to locate fuzzy matches.

The Levenshetin approach considers how many substitutions, deletions, and insertions are needed to make two strings equal (also known as the edit distance between the strings). The algorithm was later improved by Damerau to include transposition (step highlghted on line 55). This solution is an example of bottom-up dynamic programming, because a matrix is flood filled to find the minimum distance. The complexity is cubic, but is improved to K x L complexity when using a maximum distance of interest (where K is the maximum distance, and L is the shortest string). The algorithm below uses a maximum distance (threshold paramter) and other simple optimizations to vastly improve the performance of the Damerau-Levenshtein. It has been tested to run over 10K string compares/second with two given strings under ten characters in length and a maximum threshold of three edits.

public static int DamerauLevenshteinDistanceImproved(string string1, 
    string string2, int threshold)
{
    // Return trivial case - where they are equal
    if (string1.Equals(string2))
        return 0;

    // Return trivial case - where one is empty
    if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
        return (string1 ?? "").Length + (string2 ?? "").Length;

    // Ensure string2 (inner cycle) is longer
    if (string1.Length > string2.Length)
    {
        var tmp = string1;
        string1 = string2;
        string2 = tmp;
    }

    // Return trivial case - where string1 is contained within string2
    if (string2.Contains(string1))
        return string2.Length - string1.Length;

    var length1 = string1.Length;
    var length2 = string2.Length;

    var d = new int[length1 + 1, length2 + 1];

    for (var i = 0; i <= d.GetUpperBound(0); i++)
        d[i, 0] = i;

    for (var i = 0; i <= d.GetUpperBound(1); i++)
        d[0, i] = i;

    for (var i = 1; i <= d.GetUpperBound(0); i++)
    {
        var im1 = i - 1;
        var im2 = i - 2;
        var minDistance = threshold;

        for (var j = 1; j <= d.GetUpperBound(1); j++)
        {
            var jm1 = j - 1;
            var jm2 = j - 2;
            var cost = string1[im1] == string2[jm1] ? 0 : 1;

            var del = d[im1, j] + 1;
            var ins = d[i, jm1] + 1;
            var sub = d[im1, jm1] + cost;

            //Math.Min is slower than native code here
            //d[i, j] = Math.Min(del, Math.Min(ins, sub));
            d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub;

            if (i > 1 && j > 1 && string1[im1] == string2[jm2] 
                         && string1[im2] == string2[jm1])
                d[i, j] = Math.Min(d[i, j], d[im2, jm2] + cost);

            if (d[i, j] < minDistance)
                minDistance = d[i, j];
        }

        if (minDistance > threshold)
            return int.MaxValue;
    }

    return d[d.GetUpperBound(0), d.GetUpperBound(1)] > threshold
        ? int.MaxValue
        : d[d.GetUpperBound(0), d.GetUpperBound(1)];
}