Here are sample solutions for the various flavors:
|Regex options: None|
The regular expression token
\b is called a word
boundary. It matches at the start or the end of a word.
By itself, it results in a zero-length match.
\b is an
anchor, just like the tokens introduced in
the previous section.
\b matches in these three positions:
Before the first character in the subject, if the first character is a word character
After the last character in the subject, if the last character is a word character
Between two characters in the subject, where one is a word character and the other is not a word character
None of the flavors discussed in this book have separate
tokens for matching only before or only after a word. Unless you
wanted to create a regex that consists of nothing but a word
boundary, these aren’t needed. The tokens before or after the
\b in your regular
expression will determine where
\b can match. The
!\b could match only at the start of a word.
\b! could match only at the
end of a word.
!\b! can never
To run a “whole words only” search using a regular expression,
simply place the word between two word boundaries, as we did with
\bcat\b. The first
\b requires the
c to occur at the
very start of the string, or after a nonword character. The second
\b requires the
t to occur at the
very end of the string, or before a nonword character.
Line break characters are nonword characters.
\b will match after a line break if the line
break is immediately followed by a word character. It will also
match before a line break immediately preceded by a word character.
So a word that occupies a whole line by itself will be found by a
“whole words only” search.
\b is unaffected by “multiline” mode or
(?m), which is one
of the reasons why this book refers to “multiline” mode as “^ and $
match at line breaks” mode.
\B matches in these five positions:
Before the first character in the subject, if the first character is not a word character
After the last character in the subject, if the last character is not a word character
Between two word characters
Between two nonword characters
The empty string
staccato, but not in
My cat is brown,
To do the opposite of a “whole words only” search (i.e.,
My cat is
brown and including
bobcat), you need to
use alternation to combine
already taken care of that).
Although all the flavors in this book support
\B, they differ in which characters are word
\b match between two characters
where one is matched by
\w and the other by
always matches between two characters where both are matched by
[a-zA-Z0-9_]. With these flavors, you can do a
“whole words only” search on words in languages that use only the
letters A to Z without diacritics, such as English. But these flavors
cannot do “whole words only” searches on words in other languages,
such as Spanish or Russian.
.NET and Perl treat letters and digits from all scripts as word characters. With these flavors, you can do a “whole words only” search on words in any language, including those that don’t use the Latin alphabet.
Python gives you an option. Non-ASCII characters are included
only if you pass the
U flag when creating the regex.
This flag affects both
Java behaves inconsistently.
\w matches only ASCII characters. But
\b is Unicode-enabled,
supporting any script. In Java,
\b\w\b matches a single English letter, digit,
or underscore that does not occur as part of a word in any language.
correctly match the Russian word for cat, because
\b supports Unicode. But
\w+ will not match any
Russian word, because
\w is ASCII-only.
Learn more about this topic from Regular Expressions Cookbook.