Pattern Matching Algorithms

Hi there!

I’ve implemented some pattern matching algorithms in C#. They were part of a course I took at the university I study at. They are therefore hardly optimized for real life usage. They do represent the conceptual idea of the algorithms.

The algorithms I implemented are Knuth-Morris-Pratt, Quicksearch and the brute force method.

Brute force method

The brute force method is quite simple. We align our pattern with the text and every time we have a mismatch we shift our pattern one step to the right. This is very bad performance wise. Worst case we will match every letter of our text with every letter of our pattern, which equals to 0(np*nt). Imagine this for a text of 1m characters and pattern of 100.000 characters. This method is good enough for very small text and patterns but as you’ll see, the complexity of other algorithms is hardly bigger. The code is quite straight forward and given below.

Code:


        public static int BruteForce(string text, string pattern)
        {
            var nt = text.Length;
            var np = pattern.Length;

            var it = 0;
            var ip = 0;

            while (it < (nt - np))             {                 if (ip > (np - 1))
                {
                    return it;
                }
                if (it > (nt - np))
                {
                    return -1;
                }
                if (text[it + ip] == pattern[ip])
                {
                    ip++;
                }
                else
                {
                    it++;
                    ip = 0;
                }
            }
            return -1;
        }

Knuth Morris Pratt

This algorithm is rather difficult to explain in a simple blog post. So bear with me as I try. The algorithm fully depends on the sigma-function. This function will return the largest prefix of the pattern, that is also a suffix of the part of the pattern that we have already matched. This is quite crucial to understand. Take a look at the image below.

KMP

As you can see we have mismatch at the character ‘x’ and ‘a’. Using the bruteforce method we would just shift our pattern one step to the right. So the first ‘a’ would align with the second ‘b’ of our text. What KMP will do, is calculate the sigmafunction. You can see in the image, a suffix of the last ‘a’ in our pattern is ‘ab’ for example. ‘aab’ is one too but you’ll get it in a minute. This ‘ab’ is also a prefix of our pattern. Meaning, if we shift our pattern just the length of that pre- and suffix, we don’t have to match those characters anymore. Because we know for certain that all the characters before our mismatch matched. So we will shift the pattern as you can see in the image. And this is the idea behind the sigma function. Once you got it, it’s a very simple algorithm.

The code:

        private static int[] SigmaTable(string pattern)
        {
            //We will calculate the longest prefix of 0->ip that is also a prefix of the pattern

            var np = pattern.Length;
            var sigmatable = new int[np];

            var ip = 2;
            var k = 0;

            while (true)
            {
                if (ip >= np)  //We have shifted out of our pattern, we are done.
                    break;
                if (pattern[k] == pattern[ip - 1])//Our prefix extends
                {
                    sigmatable[ip] = (k + 1);
                    ip++;
                    k++;
                    continue;
                }
                if (k > 0)//We have a mismatch, so we need to see if we have a smaller prefix that we can use
                          //We already calculated this, so just get it.
                {
                    k = sigmatable[ip];
                    continue;
                }
                //We have a mismatch, but K = 0, so we dont have a prefix.
                sigmatable[ip] = 0;
                ip++;
            }
            sigmatable[np - 1] = k;
            sigmatable[0] = -1;
            return sigmatable;
        }

The actual algorithm

 public static int Kmp(string text, string pattern)
        {
            var nt = text.Length;
            var np = pattern.Length;
            var sigmatable = SigmaTable(pattern);

            var it = 0;
            var ip = 0;

            //We will keep looping, untill
            //-->We have found a match (ip > (np -1)
            //--> We have shifted out of our text, not found (it > (nt -np))
            //We have match, step forward in the pattern!
            //We have a mismatch, jump back as much as the sigmatable tells us.

            while (true)
            {
                if (ip > (np - 1))
                {
                    return it;
                }
                if (it > (nt - np))
                {
                    return -1;
                }
                if (text[it + ip] == pattern[ip])
                {
                    ip++;
                    continue;
                }
                it = (it + (ip - sigmatable[ip]));
                ip = ip > 0 ? sigmatable[ip] : 0;
            }
        }

Quick Search

This is my personal favorite for two reasons: 1) it’s fast and 2) it’s everything but difficult!

This algorithm too, will do some preprocessing, although not as difficult to explain as KMP. What we will do is, for every character in the pattern, store the left most location in the pattern. E.g “abcabcd” will have 1 for a, 2 for b, 3 for c and 7 for d.┬áNow, when we have a mismatch, what we will do is take the first character that comes after the entire pattern in the text. So in our picture above, we would take ‘y’. We then check to see if this character exists in our pattern. If it does not, we don’t even have to try matching, we skip it entirely, i.e shift the beginning of our pattern past that character (see the performance?). If it does exist, we shift our pattern the value of our table to the right and start matching from scratch.
We don’t know if our first characters will match, but we do know there is going to match one. The performance boost this algorithm gets is from the parts where we can skip characters that don’t exist in our pattern.

The code:

        private static ShiftTable ComputeShiftTable(string pattern)
        {
            //We will just calculate the position of every character in the pattern. We then store the most right location.

            int[] shift_table;

            int minAscii = pattern[0];
            int maxAscii = minAscii;

            var idx = 0;
            //First we calculate the lowest and highest ascii value to create our array
            while (idx < pattern.Length)
            {
                minAscii = Math.Min(minAscii, pattern[idx]);
                maxAscii = Math.Max(maxAscii, pattern[idx]);
                idx++;
            }

            shift_table = new int[maxAscii - minAscii + 1];

            //Loop over the table and store our patternlength in it. (If we dont find the character, we have to shift the entire pattern!).
            for (int i = 0; i < shift_table.Length; i++)
                shift_table[i] = pattern.Length + 1;

            //Loop over the pattern again. For every char we update it's position in the array.
            //This way, we always have the leftmost character in the array.
            idx = 0;
            while (idx < pattern.Length)
            {                 
                 int currentascii = pattern[idx];                 
                 shift_table[currentascii - minAscii] = pattern.Length - idx;                 
                 idx++;             
            }             
            return new ShiftTable(shift_table, maxAscii, minAscii, pattern.Length);         
} 

The algorithm

         
public static int QuickSearch(string text, string pattern)         {             
       var nt = text.Length;             
       var np = pattern.Length;             
       var shift = ComputeShiftTable(pattern);             
       var it = 0;            
       var ip = 0;             
       while (true)             {                 
                if (ip > (np - 1))//We shifted out of our pattern, found!
                {
                    return it;
                }
                if (it > (nt - np))//We reached the end of our text with our pattern, not found!
                {
                    return -1;
                }
                if (text[it + ip] == pattern[ip])//We have a match, continue.
                {
                    ip++;
                    continue;
                }
                //We didnt have match, get the first character next to our pattern, and calculate the shift back.
                var ct = text[(it + np)%nt];

                it = (it + shift.Shift(ct));
                ip = 0;
            }
        }

That’s it!

Christophe,

About these ads

3 thoughts on “Pattern Matching Algorithms

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s