Jump to content

How to add comments to a regular expression

0
  Jan Goyvaerts's Photo
Posted Sep 16 2009 10:04 AM

d{4}-d{2}-d{2} matches a date in yyyy-mm-dd format, without doing any validation of the numbers. Such a simple regular expression is appropriate when you know your data does not contain any invalid dates. Add comments to this regular expression to indicate what each part of the regular expression does.

d{4}    # Year

-        # Separator

d{2}    # Month

-        # Separator

d{2}    # Day

Regex options: Free-spacing
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

Free-spacing mode

Regular expressions can quickly become complicated and difficult to understand. Just as you should comment source code, you should comment all but the most trivial regular expressions.

Most regular expression flavors, with the exception of Javascript, offer an alternative regular expression syntax that makes it very easy to clearly comment your regular expressions. You can enable this syntax by turning on the free-spacing option. It has different names in various programming languages.

In .NET, set the RegexOptions.IgnorePatternWhitespace option. In Java, pass the Pattern.COMMENTS flag. Python expects re.VERBOSE. PHP, Perl, and Ruby use the /x flag.

Turning on free-spacing mode has two effects. It turns the hash symbol (#) into a metacharacter, outside character classes. The hash starts a comment that runs until the end of the line or the end of the regex (whichever comes first). The hash and everything after it is simply ignored by the regular expression engine. To match a literal hash sign, either place it inside a character class [#] or escape it #.

The other effect is that whitespace, which includes spaces, tabs, and line breaks, is also ignored outside character classes. To match a literal space, either place it inside a character class [] or escape it . If you’re concerned about readability, you could use the hexadecimal escape x20 or the Unicode escape u0020 or x{0020} instead. To match a tab, use t. For line breaks, use rn (Windows) or n (Unix/Linux/OS X).

Free-spacing mode does not change anything inside character classes. A character class is a single token. Any whitespace characters or hashes inside character classes are literal characters that are added to the character class. You cannot break up character classes to comment their parts.


Java has free-spacing character classes

Regular expressions wouldn’t live up to their reputation unless at least one flavor was incompatible with the others. In this case, Java is the odd one out.

In Java, character classes are not parsed as single tokens. If you turn on free-spacing mode, Java ignores whitespace in character classes, and hashes inside character classes do start comments. This means you cannot use [] and [#] to match these characters literally. Use u0020 and # instead.

Variations


(?#Year)d{4}(?#Separator)-(?#Month)d{2}-(?#Day)d{2}

Regex options: None
Regex flavors: .NET, PCRE, Perl, Python, Ruby

If, for some reason, you can’t or don’t want to use free-spacing syntax, you can still add comments by way of (?#comment). All characters between (?# and ) are ignored.

Unfortunately, Javascript, the flavor that doesn’t support free-spac⁠ing, also doesn’t support this comment syntax. Java does not support it either.

(?x)d{4}    # Year

-            # Separator

d{2}        # Month

-            # Separator

d{2}        # Day

Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

If you cannot turn on free-spacing mode outside the regular expression, you can place the mode modifier (?x) at the very start of the regular expression. Make sure there’s no whitespace before the (?x). Free-spacing mode begins only at this mode modifier; any whitespace before it is significant.

Regular Exp<b></b>ressions Cookbook

Learn more about this topic from Regular Expressions Cookbook.

This cookbook provides more than 100 recipes to help you crunch data and manipulate text with regular expressions. With recipes for popular programming languages such as C#, Java, Javascript, Perl, PHP, Python, Ruby, and VB.NET, Regular Expressions Cookbook will help you learn powerful new tricks, avoid language-specific gotchas, and save valuable time with this library of proven solutions to difficult, real-world problems.

See what you'll learn


Tags:
0 Subscribe


0 Replies