If you want to instantiate a regular expression object or otherwise compile a regular expression so you can use it efficiently throughout your application, try one of the following examples:
If you know the regex to be correct:
Regex regexObj = new Regex("regex pattern");If the regex is provided by the end user (UserInput being a string
variable):
try {
Regex regexObj = new Regex(UserInput);
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}If you know the regex to be correct:
Dim RegexObj As New Regex("regex pattern")If the regex is provided by the end user (UserInput being a string
variable):
Try
Dim RegexObj As New Regex(UserInput)
Catch ex As ArgumentException
'Syntax error in the regular expression
End TryIf you know the regex to be correct:
Pattern regex = Pattern.compile("regex pattern");If the regex is provided by the end user (userInput being a string
variable):
try {
Pattern regex = Pattern.compile(userInput);
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}To be able to use the regex on a string, create a Matcher:
Matcher regexMatcher = regex.matcher(subjectString);
To use the regex on another string, you can create a new
Matcher, as just
shown, or reuse an existing one:
regexMatcher.reset(anotherSubjectString);
Literal regular expression in your code:
var myregexp = /regex pattern/;Regular expression retrieved from user input, as a string
stored in the variable userinput:
var myregexp = new RegExp(userinput);
$myregex = qr/regex pattern/Regular expression retrieved from user input, as a string
stored in the variable $userinput:
$myregex = qr/$userinput/
reobj = re.compile("regex pattern")Regular expression retrieved from user input, as a string
stored in the variable userinput:
reobj = re.compile(userinput)
Before the regular expression engine can match a regular expression to a string, the regular expression has to be compiled. This compilation happens while your application is running. The regular expression constructor or compile function parses the string that holds your regular expression and converts it into a tree structure or state machine. The function that does the actual pattern matching will traverse this tree or state machine as it scans the string. Programming languages that support literal regular expressions do the compilation when execution reaches the regular expression operator.
In C# and VB.NET, the .NET class System.Text.RegularExpressions.Regex holds one
compiled regular expression. The simplest constructor takes just one parameter: a
string that holds your regular expression.
If there’s a syntax error in the regular expression, the
Regex() constructor
will throw an ArgumentException. The exception message will indicate exactly which
error was encountered. It is important to catch this exception if
the regular expression is provided by the user of your application.
Display the exception message and ask the user to correct the
regular expression. If your regular expression is a hardcoded string
literal, you can omit catching the exception if you use a code
coverage tool to make sure the line is executed without throwing an
exception. There are no possible changes to state or mode that could
cause the same literal regex to compile in one situation and fail to
compile in another. Note that if there is a syntax error in your
literal regex, the exception will occur when your application is
run, not when your application is compiled.
You should construct a Regex object if you will be using the regular
expression inside a loop or repeatedly throughout your application.
Constructing the regex object involves no extra overhead. The static
members of the Regex class that
take the regex as a string parameter construct a Regex object internally
anyway, so you might just as well do it in your own code and keep a
reference to the object.
If you plan to use the regex only once or a few times, you can
use the static members of the Regex class instead, to save a line of
code. The static Regex members do not throw away the internally
constructed regular expression object immediately; instead, they
keep a cache of the 15 most recently used regular expressions. You
can change the cache size by setting the Regex.CacheSize property. The cache lookup is done by looking up your regular
expression string in the cache. But don’t go overboard with the
cache. If you need lots of regex objects frequently, keep a cache of
your own that you can look up more efficiently than with a string
search.
In Java, the Pattern
class holds one compiled regular expression. You can create objects
of this class with the Pattern.compile() class factory, which
requires just one parameter: a string with your regular
expression.
If there’s a syntax error in the regular expression, the
Pattern.compile()
factory will throw a PatternSyntaxException. The exception message will indicate exactly which
error was encountered. It is important to catch this exception if
the regular expression is provided by the user of your application.
Display the exception message and ask the user to correct the
regular expression. If your regular expression is a hardcoded string
literal, you can omit catching the exception if you use a code
coverage tool to make sure the line is executed without throwing an
exception. There are no possible changes to state or mode that could
cause the same literal regex to compile in one situation and fail to
compile in another. Note that if there is a syntax error in your
literal regex, the exception will occur when your application is
run, not when your application is compiled.
Unless you plan to use a regex only once, you should create a
Pattern object
instead of using the static members of the String class. Though it takes a few lines of
extra code, that code will run more efficiently. The static calls
recompile your regex each and every time. In fact, Java provides
static calls for only a few very basic regex tasks.
A Pattern
object only stores a compiled regular expression; it does not do any
actual work. The actual regex matching is done by the Matcher class. To create a Matcher, call the matcher() method on your compiled regular
expression. Pass the subject string as the only argument to matcher().
You can call matcher() as many times as you like to use the
same regular expression on multiple strings. You can work with
multiple matchers using the same regex at the same time, as long as
you keep everything in a single thread. The Pattern and Matcher classes
are not thread-safe. If you want to use the same regex in
multiple threads, call Pattern.compile() in
each thread.
If you’re done applying a regex to one string and want to
apply the same regex to another string, you can reuse the Matcher object by calling
reset(). Pass the next subject string as the only argument.
This is more efficient than creating a new Matcher object. reset() returns the same Matcher you called it on,
allowing you to easily reset and use a matcher in one line of code,
e.g., regexMatcher.reset(nextString).find().
The notation for literal regular expressions already creates a new regular expression object. To use the same object repeatedly, simply assign it to a variable.
If you have a regular expression stored in a string variable
(e.g., because you asked the user to type in a regular expression),
use the RegExp()
constructor to compile the regular expression. Notice that the regular expression inside the string
is not delimited by forward slashes. Those slashes are part of
Javascript’s notation for literal RegExp objects, rather than part of the
regular expression itself.
Tip
Since assigning a literal regex to a variable is trivial, most of the Javascript solutions in this chapter omit this line of code and use the literal regular expression directly. In your own code, when using the same regex more than once, you should assign the regex to a variable and use that variable instead of pasting the same literal regex multiple times into your code. This increases performance and makes your code easier to maintain.
PHP does not provide a way to store a compiled regular expression
in a variable. Whenever you want to do something with a regular
expression, you have to pass it as a string to one of the preg functions.
The preg
functions keep a cache of up to 4,096 compiled regular expressions.
Although the hash-based cache lookup is not as fast as referencing a
variable, the performance hit is not as dramatic as having to
recompile the same regular expression over and over. When the cache
is full, the regex that was compiled the longest ago is
removed.
You can use the “quote regex” operator to compile a regular expression and assign it to a variable.
Perl is generally quite efficient at reusing previously
compiled regular expressions. Therefore, we don’t use qr// in the code samples here.
qr// is useful
when you’re interpolating variables in the regular expression or
when you’ve retrieved the whole regular expression as a string
(e.g., from user input). With qr/$regexstring/, you can control when the
regex is recompiled to reflect the new contents of $regexstring. m/$regexstring/ would
recompile the regex every time, whereas m/$regexstring/o never recompiles it.
The compile()
function in Python’s re module takes a string with your regular
expression, and returns an object with your compiled regular
expression.
You should call compile() explicitly if you plan to use the
same regular expression repeatedly. All the functions in the
re module first call
compile(), and then
call the function you wanted on the compiled regular expression
object.
The compile()
function keeps a reference to the last 100 regular expressions that
it compiled. This reduces the recompilation of any of the last 100
used regular expressions to a dictionary lookup. When the cache is
full, it is cleared out entirely.
If performance is not an issue, the cache works well enough
that you can use the functions in the re module directly. But when performance
matters, calling compile() is a good idea.
The notation for literal regular expressions already creates a new regular expression object. To use the same object repeatedly, simply assign it to a variable.
If you have a regular expression stored in a string variable
(e.g., because you asked the user to type in a regular expression),
use the Regexp.new()
factory or its synonym Regexp.compile() to compile the regular
expression. Notice that the regular expression inside the string is
not delimited by forward slashes. Those slashes are part of Ruby’s
notation for literal Regexp objects and are not part of the regular
expression itself.
Tip
Since assigning a literal regex to a variable is trivial, most of the Ruby solutions in this chapter omit this line of code and use the literal regular expression directly. In your own code, when using the same regex more than once, you should assign the regex to a variable and use the variable instead of pasting the same literal regex multiple times into your code. This increases performance and makes your code easier to maintain.
When you construct a Regex object in .NET without passing any
options, the regular expression is compiled in the way we described in the previous section. If you pass RegexOptions.Compiled as a second parameter to
the Regex()
constructor, the Regex
class does something rather different: it compiles your regular
expression down to CIL, also known as MSIL. CIL stands for Common Intermediate Language, a low-level
programming language that is closer to assembly than to C# or Visual
Basic. All .NET compilers produce CIL. The first time your application
runs, the .NET framework compiles the CIL further down to machine code
suitable for the user’s computer.
The benefit of compiling a regular expression with RegexOptions.Compiled is that it
can run up to 10 times faster than a regular expression compiled
without this option. The
drawback is that this compilation can be up to two orders of
magnitude slower than simply parsing the regex string into a tree. The
CIL code also becomes a permanent part of your application until it is
terminated. CIL code is not garbage collected.
Use RegexOptions.Compiled only if a regular
expression is either so complex or needs to process so much text that
the user experiences a noticeable wait during operations using the
regular expression. The compilation and assembly overhead is not worth
it for regexes that do their job in a split second.
This cookbook provides more than 100 recipes to help you crunch data and manipulate text with regular expressions. With recipes for popular programming languages such as C#, Java, Javascript, Perl, PHP, Python, Ruby, and VB.NET, Regular Expressions Cookbook will help you learn powerful new tricks, avoid language-specific gotchas, and save valuable time with this library of proven solutions to difficult, real-world problems.




Help





