|
Comments
Did you read today's front page stories & breaking news?
SYS-CON.TV
|
Product Reviews Patterns of a Different Kind
Patterns of a Different Kind
By: Mike Morris
May. 28, 2003 12:00 AM
Much has been made of .NET's language-neutral features and how programmers can choose from a wide variety of languages ranging from C# and VB.NET to niche languages such as Python and Eiffel. But there is one dialect that all language variants speak - and that is the language of regular expressions. Regular expressions grew out of work done by Hartford, Connecticut-born mathematician Stephen Kleene on what he termed "the algebra of regular sets." His theoretical ideas later formed the basis for early text-manipulation tools on the Unix platform and were further popularized by tight integration of a regular expression engine into the Perl programming language. Most programmers who have been coding for any length of time have had occasion to use regular expressions in some fashion. Their terse, somewhat cryptic syntax can be intimidating at first, but with a basic understanding of the most common pattern constructs, regular expressions can become an indispensable part of the .NET programmer's toolkit. The richly expressive nature of regular expressions allows programmers to work with patterns and matches within text in a powerful and flexible way. They are ideal for validation of data entry, extracting and replacing substrings, and generating reports for text-based nonrelational data. This article begins with a general overview of regular expressions. If you're already comfortable with them, feel free to skip over the first section and delve into the.NET implementation details!
Regular Expressions: An Overview
Wildcards .he.. The purple lathe was in the tree Notice how the expression now matches on the uppercase T in The, as well as the space character between some of the words. If the intent were to actually match a period in the target, the period in question would need to be escaped by prefixing it with a backslash. For example, .he\../. This escape mechanism applies to all metacharacters for which a literal match is the intent.
Positional Characters ^.he.. The purple lathe was in the tree The caret metacharacter anchors the pattern to the beginning of the target text and limits the match. The $ and \b work in a similar manner.
Character Classes [Tt]he The purple lathe was in the tree Notice how The matched this time because the [Tt] character class includes both an upper- and lowercase T. Additionally, the caret (^) symbol can be used to negate the meaning of the character class (when used inside [..]). For instance, using [^Tt]he would match all instances of he not preceded by an upper or lowercase t. Most of the commonly used character classes have an escaped shortcut notation to facilitate their use. For instance, \w matches any word character and is equivalent to [a-zA-Z_0-9]. For more details on character class shorthand notations, see Table 2.
Grouping and Alternation (purp)\w The purple lathe purports to turn Alternation allows for a choice between two or more combinations of literals and metacharacters and is implemented by using the pipe (|) symbol. The pipe equates to a logical or and is applied as follows: (purp|lat)\w The purple lathe purports to turn
Quantifiers The *, + and ? quantifiers are termed abstract quantifiers. There is another class of quantifiers called numeric quantifiers. These use curly braces to specify more exacting patterns. They come in three flavors: {n}, {n,}, and {n, x}, with n and x being integers. The first of these specifies that the element or subexpression occurs exactly n times. The second means at least n number of times and the third at least n number of times but no more than x times.
(purp).{5} The purple lathe purports to turn
Regular Expressions in the .NET Framework Certain behaviors and characteristics of the .NET regular expression engine can be tweaked by specifying regular expression options at object construction time or at the time of method invocation. These options are defined in the RegexOptions enumeration and can be combined using a bitwise combination of RegexOptions values. A list of these options is shown in Table 1. The reference documentation lists ten classes in the System.Text.RegularExpressions namespace, but for our purposes we will focus on the three core objects: Regex, Match, and Group. The noncore classes are either collection-based classes of core objects (as is the case with MatchCollection and GroupCollection), reserved for .NET Framework internals code, or beyond the scope of this article. All code samples will be illustrated using the C# language and assume that the System.Text.Regular Expressions namespace has been imported via the using statement to allow for shorthand notation of the regex types. That said, let's take a look at the core .NET regular expression objects!
Regex Object Regex re = new Regex("[Tt]ruth"); This statement will initialize and compile the regular expression, which can then be used to test for the existence of a match, capture the matched text, replace substrings within the target string, or split the text into a string array based on a regular expression delimiter.
Simple Matching
if(re.IsMatch("We hold these Truths to be self-evident...")) The Regex object also contains a handful of static or convenience methods for accomplishing many regex-related tasks without explicitly instantiating a Regex object. The statement above can be rewritten as follows:
if(Regex.IsMatch("[Tt]ruth" , This code is somewhat easier to read and more succinct. For a complete list of IsMatch() overloaded methods, refer to the .NET Framework documentation. It is worth noting as we move along that many of the overloaded methods for the core .NET regex objects take Int32 arguments. The general rule is that when only one Int32 argument is present it represents an offset into the target string that determines where the search begins. The default is at the beginning of the target string, or position zero. For example, RegEx.IsMatch ("Take me out to the ballgame", "ak", 10) would return false since the only text in the target string that would be searched is the bolded text in: Take me out to the ballgame.
Capturing Matched Text
Regex re = new Regex(@"\d{2}[-/]\d{2}[-/]\d{2,4}"); This excerpt of code would result in Meeting date: 02-23-2003 being displayed in the console window. The Match method of the Regex object returns a Match object that in turn exposes a Value property of type string. The regex used here is worth taking a closer look at. Using a combination of character class shorthand expressions (\d), numeric quantifiers ({2}, {2,4}), as well as custom character classes ([-/]), this expression would have matched on dates in the form of: 02-23-2003, 02/23/2003, 02-23-03, and 02/23/03. This is where the real power of regular expressions become evident - when dealing with subtle differences in patterns that are for our purposes semantically the same. One other item of note is the use of the C# verbatim string literal @. This allows us to place escape sequences within our regex definitions without getting a compile error. VB.NET does not require the use of the @, and escape sequences such as \d or \s can be placed directly within a string definition with no ill effects. The @ literal will be used throughout many of the examples in this article.
String Replacement
string oldstring = @"Beginning balance : This would result in Beginning balance : $12.00, Ending Balance $6.00 being written to the console window. The ability to use a regular expression here allows us to abstract our replacement in two very important ways: first, we are only interested in side-by-side numeric characters, and second, only those that follow a decimal point. This task could have been accomplished via manual parsing and procedural techniques but certainly not with one line of code! For a complete list of Replace() overloaded methods, refer to the .NET Framework documentation. There is one overload of the Replace() method that requires two Int32 arguments. The second of these is the offset parameter discussed earlier. The first Int32 argument in this instance represents a count. That is, how many times should the replacement occur. For example, re.Replace(x, y, 5, 10) would replace matches in x, with string y a maximum of 5 times starting at position 10. Similar overload method signatures exist for most core .NET regex objects.
String Splitting
string[] parts = Regex.Split("860 555-4321", The key here is to once again point out how the ability to use a regular expression to match on the delimiter allows us to do so in a much more flexible and succinct way. The expression \.|\s+|- will parse the following phone numbers in the exact same manner:
860.555.4321 Once again, for a complete list of Split() overloaded methods, refer to the .NET Framework documentation. Match ObjectB The Match object represents a single match for a given regular expression. What this really means becomes clearer if we take a look at a previous example: (purp)\w The purple lathe purports to turn Applying the regular expression (purp)\w to the target text results in two distinct matches. In .NET, each one of these is defined by a Match object. Match objects are immutable and have no public constructors. There are several means of getting a reference to an individual Match object: Regex.Match() static method, reObj.Match() instance method, matchObj.NextMatch() instance method, or by traversing the Matches collection (MatchCollection) returned by the RegEx.Matches static or reObj.Matches() instance methods. The Match object exposes some very useful properties that allow the programmer to go beyond simple data validation techniques or string replacements. Table 3 shows the possibilities: Listing 1 shows an example of the use of the Match object. We simply used the static Matches method of the Regex object to return a Matches collection. Using the foreach construct, we then loop through the individual Match objects and display their properties. .NET's ability to capture each match and expose it as an easy-to-use object opens up many possibilities for custom text processing. There is one important property of the Match object that we did not discuss yet and that is the Groups property, discussed in the next section.
Group Object (\d\d?)-(\d\d?)-(\d{2,4}) Kaitlyn's birthday is 11-14-1998 and Ryan's is 09-11-2001. As we would expect, the expression matches on both dates. Each of these is represented by a discrete Match object. Each of these Match objects also exposes a Groups property, which is a collection of Group objects. The Group object exposes the same properties as the Match object: Success, Value, Length, and Index (in fact, the Match object is derived from the Group object!). In this case, the regex (\d\d?)-(\d\d?)-(\d{2,4}) defines three capturing groups (parenthetical expressions). The first Match object would have a Groups collection that looks like:
match1.Groups[0].Value equals 11-14-1998 The first thing to note is Groups[0]. Groups[0] is the matching text in it's entirety and is logically equivalent to match.Value. It is Groups[1-n] that are of the most interest to us. These allow us to gain access to submatches within our matches simply by indexing into the Groups collection using the ordinal position of the capturing group in the regex. In this example the use of capturing groups not only allows us to extract dates from a string but also to easily parse the month, day, and year! What could be easier? Well, actually, there is a technique that can be employed to make programming and working with groups somewhat easier and less brittle than using captured group ordinal positions. We can use something called named groups. Named groups are implemented by decorating the capturing groups with a name! The most common way to do this is with the following construct: ?<name>. So, if we modify our previous regex to use named groups, it would look like: (?<month>\d\d?)-(?<day>\d\d?)-(?<year>\d{2,4}) What this allows us to do is to index into the Groups collection using the group name rather than ordinal position, as follows:
match1.Groups["month"].Value equals 11
Advanced .NET: The MatchEvaluator Delegate Say, for instance we have a document that has fourth-quarter sales results for our company. All figures are in U.S. dollars, but we need to have our Moscow office look them over before presenting to shareholders. In order to facilitate this, we will convert U.S. dollars to Russian rubles in all of our documents. We can do this using the code shown in Listing 2. Executing this code would result in newdoc being set equal to ... 438,522 rubles for services and 566,574 rubles in product .... The only requirement placed on our custom method is that its signature match the MatchEvaluator delegate definition. This means that it must take a Match object as an argument and return a string. Beyond this, we can implement the delegate however we choose. MatchEvaluators certainly open up a whole host of possibilities when it comes to replacing strings within strings. The Match object gives us access to the Groups collection, as illustrated in the example. So, go now and Replace () with abandon!
Conclusion The types in the System.Text-.RegularExpressions namespace are full-fledged members of the .NET Framework class library - not adjuncts or an afterthought. Hopefully, .NET regular expressions will help to make the arcane text parsing tasks possible and the mundane ones more fun! Reader Feedback: Page 1 of 1
Latest Cloud Developer Stories
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
|
SYS-CON Featured Whitepapers
Most Read This Week
Breaking Cloud Computing News
|
|||||||||||||||||||||||||||||||||||||||||||||||||