d{4}-d{2}-d{2} matches a date in yyyy-mm-dd
format, without doing any validation of the numbers. Such a
simple regular expression is appropriate when you know your data does
not contain any invalid dates. Add comments to this regular expression
to indicate what each part of the regular expression does.
d{4} # Year
- # Separator
d{2} # Month
- # Separator
d{2} # Day
| Regex options: Free-spacing |
| Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
Regular expressions can quickly become complicated and difficult to understand. Just as you should comment source code, you should comment all but the most trivial regular expressions.
Most regular expression flavors, with the exception of Javascript, offer an alternative regular expression syntax that makes it very easy to clearly comment your regular expressions. You can enable this syntax by turning on the free-spacing option. It has different names in various programming languages.
In .NET, set the RegexOptions.IgnorePatternWhitespace option.
In Java, pass the Pattern.COMMENTS
flag. Python expects re.VERBOSE. PHP, Perl, and Ruby use the
/x flag.
Turning on free-spacing mode has two effects. It turns the hash symbol (#) into a metacharacter, outside character
classes. The hash starts a comment that runs until the end of the
line or the end of the regex (whichever comes first). The hash and
everything after it is simply ignored by the regular expression
engine. To match a literal hash sign, either place it inside a
character class [#] or escape it #.
The other effect is that whitespace, which includes spaces, tabs, and line breaks, is
also ignored outside
character classes. To match a literal space, either
place it inside a character
class [●] or escape it ●. If you’re concerned about
readability, you could use the hexadecimal escape x20 or the Unicode escape
u0020 or x{0020} instead. To match a
tab, use t. For
line breaks, use rn (Windows) or n (Unix/Linux/OS X).
Free-spacing mode does not change anything inside character classes. A character class is a single token. Any whitespace characters or hashes inside character classes are literal characters that are added to the character class. You cannot break up character classes to comment their parts.
Regular expressions wouldn’t live up to their reputation unless at least one flavor was incompatible with the others. In this case, Java is the odd one out.
In Java, character classes are not parsed as single tokens. If
you turn on free-spacing mode, Java ignores whitespace in character
classes, and hashes inside character classes do start comments. This
means you cannot use [●] and [#] to match these characters
literally. Use u0020 and # instead.
(?#Year)d{4}(?#Separator)-(?#Month)d{2}-(?#Day)d{2}
| Regex options: None |
| Regex flavors: .NET, PCRE, Perl, Python, Ruby |
If, for some reason, you can’t or don’t want to use free-spacing
syntax, you can still add comments by way of (?#comment). All characters between (?# and ) are ignored.
Unfortunately, Javascript, the flavor that doesn’t support free-spacing, also doesn’t support this comment syntax. Java does not support it either.
(?x)d{4} # Year
- # Separator
d{2} # Month
- # Separator
d{2} # Day
| Regex options: None |
| Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
If you cannot turn on free-spacing mode outside the regular
expression, you can place the mode modifier (?x) at the very start of the regular
expression. Make sure there’s no whitespace before the (?x). Free-spacing mode begins
only at this mode modifier; any whitespace before it is
significant.
Learn more about this topic from Regular Expressions Cookbook.
This cookbook provides more than 100 recipes to help you crunch data and manipulate text with regular expressions. With recipes for popular programming languages such as C#, Java, Javascript, Perl, PHP, Python, Ruby, and VB.NET, Regular Expressions Cookbook will help you learn powerful new tricks, avoid language-specific gotchas, and save valuable time with this library of proven solutions to difficult, real-world problems.

Help

