The following are common uses for Unicode code points, properties, blocks and scripts:
Use a regular expression to find the trademark sign (™) by specifying its Unicode code point rather than copying and pasting an actual trademark sign. If you like copy and paste, the trademark sign is just another literal character, even though you cannot type it directly on your keyboard.
Create a regular expression that matches any character that has the “Currency Symbol” Unicode property. Unicode properties are also called Unicode categories.
Create a regular expression that matches any character in the “Greek Extended” Unicode block.
Create a regular expression that matches any character that, according to the Unicode standard, is part of the Greek script.
Create a regular expression that matches a grapheme, or what is commonly thought of as a character: a base character with all its combining marks.
Here are sample solutions for the various flavors:
\u2122
| Regex options: None |
| Regex flavors: .NET, Java, Javascript, Python |
This regex works in Python only when quoted as a Unicode
string: u"\u2122".
\x{2122}| Regex options: None |
| Regex flavors: PCRE, Perl, Ruby 1.9 |
PCRE must be compiled with UTF-8 support; in PHP, turn on
UTF-8 support with the /u pattern
modifier. Ruby 1.8 does not support Unicode regular
expressions.
\p{Sc}| Regex options: None |
| Regex flavors: .NET, Java, PCRE, Perl, Ruby 1.9 |
PCRE must be compiled with UTF-8 support; in PHP, turn on
UTF-8 support with the /u pattern
modifier. Javascript and Python do not support Unicode properties.
Ruby 1.8 does not support Unicode regular expressions.
\p{IsGreekExtended}| Regex options: None |
| Regex flavors: .NET, Perl |
\p{InGreekExtended}| Regex options: None |
| Regex flavors: Java, Perl |
Javascript, PCRE, Python, and Ruby do not support Unicode blocks.
\p{Greek}| Regex options: None |
| Regex flavors: PCRE, Perl, Ruby 1.9 |
Unicode script support requires PCRE 6.5 or later, and PCRE
must be compiled with UTF-8 support. In PHP, turn on UTF-8 support
with the /u pattern modifier.
.NET, Javascript, and Python do not support Unicode properties. Ruby
1.8 does not support Unicode regular expressions.
\X
| Regex options: None |
| Regex flavors: PCRE, Perl |
PCRE and Perl have a dedicated token for matching graphemes, but also support the workaround syntax using Unicode properties.
\P{M}\p{M}*| Regex options: None |
| Regex flavors: .NET, Java, PCRE, Perl, Ruby 1.9 |
PCRE must be compiled with UTF-8 support; in PHP, turn on
UTF-8 support with the /u pattern
modifier. Javascript and Python do not support Unicode properties.
Ruby 1.8 does not support Unicode regular expressions.
A code point is one entry in the Unicode character database. A code point is not the same as a character, depending on the meaning you give to “character.” What appears as a character on screen is called a grapheme in Unicode.
The Unicode code point U+2122 represents the “trademark sign”
character. You can match this with \u2122 or \x{2122}, depending on the regex flavor you’re working
with.
The \u syntax
requires exactly four hexadecimal digits. This means you can only
use it for Unicode code points U+0000 through U+FFFF. The \x syntax allows any number
of hexadecimal digits, supporting all code points U+000000 through
U+10FFFF. You can match U+00E0 with \x{E0} or \x{00E0}. Code points U+100000 are used very
infrequently, and are poorly supported by fonts and operating
systems.
Code points can be used inside and outside character classes.
Each Unicode code point has exactly one Unicode property, or fits into a single Unicode category. These terms mean the same thing. There are 30 Unicode categories, grouped into 7 super-categories:
\p{L}Any kind of letter from any language
\p{Ll}A lowercase letter that has an uppercase variant
\p{Lt}A letter that appears at the start of a word when only the first letter of the word is capitalized
\p{Lo}A letter or ideograph that does not have lowercase and uppercase variants
\p{M}A character intended to be combined with another character (accents, umlauts, enclosing boxes, etc.)
\p{Mn}A character intended to be combined with another character that does not take up extra space (e.g., accents, umlauts, etc.)
\p{Me}A character that encloses another character (circle, square, keycap, etc.)
\p{Z}Any kind of whitespace or invisible separator
\p{Zs}A whitespace character that is invisible, but does take up space
\p{Zl}The line separator character U+2028
\p{Zp}The paragraph separator character U+2029
\p{S}Math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm}Any mathematical symbol
\p{Sc}Any currency sign
\p{Sk}A combining character (mark) as a full character on its own
\p{So}Various symbols that are not math symbols, currency signs, or combining characters
\p{N}Any kind of numeric character in any script
\p{Nd}A digit 0 through 9 in any script except ideographic scripts
\p{Nl}A number that looks like a letter, such as a Roman numeral
\p{No}A superscript or subscript digit, or a number that is not a digit 0…9 (excluding numbers from ideographic scripts)
\p{P}Any kind of punctuation character
\p{Pd}Any kind of hyphen or dash
\p{Ps}Any kind of opening bracket
\p{Pe}Any kind of closing bracket
\p{Pi}Any kind of opening quote
\p{Pf}Any kind of closing quote
\p{Pc}A punctuation character such as an underscore that connects words
\p{Po}Any kind of punctuation character that is not a dash, bracket, quote or connector
\p{C}Invisible control characters and unused code points
\p{Cc}An ASCII 0x00…0x1F or Latin-1 0x80…0x9F control character
\p{Cf}An invisible formatting indicator
\p{Co}Any code point reserved for private use
\p{Cs}One half of a surrogate pair in UTF-16 encoding
\p{Cn}Any code point to which no character has been assigned
\p{Ll} matches
a single code point that has the Ll, or “lowercase letter,” property.
\p{L} is a quick way
of writing [\p{Ll}\p{Lu}\p{Lt}\p{Lm}\p{Lo}] that matches
a single code point in any of the “letter” categories.
\P is the
negated version of \p. \P{Ll} matches a single code point that does
not have the Ll property.
\P{L} matches a
single code point that does not have any of the “letter” properties.
This is not the same as [\P{Ll}\P{Lu}\P{Lt}\P{Lm}\P{Lo}], which
matches all code points. \P{Ll} matches the code points with the
Lu property (and every other
property except Ll), whereas
\P{Lu} includes the
Ll code points. Combining just
these two in a code point class already matches all possible code
points.
The Unicode character database divides all the code points into blocks. Each block consists of a single range of code points. The code points U+0000 through U+FFFF are divided into 105 blocks:
| U+0000…U+007F | \p{InBasic_Latin} |
| U+0080…U+00FF | \p{InLatin-1_Supplement} |
| U+0100…U+017F | \p{InLatin_Extended-A} |
| U+0180…U+024F | \p{InLatin_Extended-B} |
| U+0250…U+02AF | \p{InIPA_Extensions} |
| U+02B0…U+02FF | \p{InSpacing_Modifier_Letters} |
| U+0300…U+036F | \p{InCombining_Diacritical_Marks} |
| U+0370…U+03FF | \p{InGreek_and_Coptic} |
| U+0400…U+04FF | \p{InCyrillic} |
| U+0500…U+052F | \p{InCyrillic_Supplementary} |
| U+0530…U+058F | \p{InArmenian} |
| U+0590…U+05FF | \p{InHebrew} |
| U+0600…U+06FF | \p{InArabic} |
| U+0700…U+074F | \p{InSyriac} |
| U+0780…U+07BF | \p{InThaana} |
| U+0900…U+097F | \p{InDevanagari} |
| U+0980…U+09FF | \p{InBengali} |
| U+0A00…U+0A7F | \p{InGurmukhi} |
| U+0A80…U+0AFF | \p{InGujarati} |
| U+0B00…U+0B7F | \p{InOriya} |
| U+0B80…U+0BFF | \p{InTamil} |
| U+0C00…U+0C7F | \p{InTelugu} |
| U+0C80…U+0CFF | \p{InKannada} |
| U+0D00…U+0D7F | \p{InMalayalam} |
| U+0D80…U+0DFF | \p{InSinhala} |
| U+0E00…U+0E7F | \p{InThai} |
| U+0E80…U+0EFF | \p{InLao} |
| U+0F00…U+0FFF | \p{InTibetan} |
| U+1000…U+109F | \p{InMyanmar} |
| U+10A0…U+10FF | \p{InGeorgian} |
| U+1100…U+11FF | \p{InHangul_Jamo} |
| U+1200…U+137F | \p{InEthiopic} |
| U+13A0…U+13FF | \p{InCherokee} |
| U+1400…U+167F | \p{InUnified_Canadian_Aboriginal_Syllabics} |
| U+1680…U+169F | \p{InOgham} |
| U+16A0…U+16FF | \p{InRunic} |
| U+1700…U+171F | \p{InTagalog} |
| U+1720…U+173F | \p{InHanunoo} |
| U+1740…U+175F | \p{InBuhid} |
| U+1760…U+177F | \p{InTagbanwa} |
| U+1780…U+17FF | \p{InKhmer} |
| U+1800…U+18AF | \p{InMongolian} |
| U+1900…U+194F | \p{InLimbu} |
| U+1950…U+197F | \p{InTai_Le} |
| U+19E0…U+19FF | \p{InKhmer_Symbols} |
| U+1D00…U+1D7F | \p{InPhonetic_Extensions} |
| U+1E00…U+1EFF | \p{InLatin_Extended_Additional} |
| U+1F00…U+1FFF | \p{InGreek_Extended} |
| U+2000…U+206F | \p{InGeneral_Punctuation} |
| U+2070…U+209F | \p{InSuperscripts_and_Subscripts} |
| U+20A0…U+20CF | \p{InCurrency_Symbols} |
| U+20D0…U+20FF | \p{InCombining_Diacritical_Marks_for_Symbols} |
| U+2100…U+214F | \p{InLetterlike_Symbols} |
| U+2150…U+218F | \p{InNumber_Forms} |
| U+2190…U+21FF | \p{InArrows} |
| U+2200…U+22FF | \p{InMathematical_Operators} |
| U+2300…U+23FF | \p{InMiscellaneous_Technical} |
| U+2400…U+243F | \p{InControl_Pictures} |
| U+2440…U+245F | \p{InOptical_Character_Recognition} |
| U+2460…U+24FF | \p{InEnclosed_Alphanumerics} |
| U+2500…U+257F | \p{InBox_Drawing} |
| U+2580…U+259F | \p{InBlock_Elements} |
| U+25A0…U+25FF | \p{InGeometric_Shapes} |
| U+2600…U+26FF | \p{InMiscellaneous_Symbols} |
| U+2700…U+27BF | \p{InDingbats} |
| U+27C0…U+27EF | \p{InMiscellaneous_Mathematical_Symbols-A} |
| U+27F0…U+27FF | \p{InSupplemental_Arrows-A} |
| U+2800…U+28FF | \p{InBraille_Patterns} |
| U+2900…U+297F | \p{InSupplemental_Arrows-B} |
| U+2980…U+29FF | \p{InMiscellaneous_Mathematical_Symbols-B} |
| U+2A00…U+2AFF | \p{InSupplemental_Mathematical_Operators} |
| U+2B00…U+2BFF | \p{InMiscellaneous_Symbols_and_Arrows} |
| U+2E80…U+2EFF | \p{InCJK_Radicals_Supplement} |
| U+2F00…U+2FDF | \p{InKangxi_Radicals} |
| U+2FF0…U+2FFF | \p{InIdeographic_Description_Characters} |
| U+3000…U+303F | \p{InCJK_Symbols_and_Punctuation} |
| U+3040…U+309F | \p{InHiragana} |
| U+30A0…U+30FF | \p{InKatakana} |
| U+3100…U+312F | \p{InBopomofo} |
| U+3130…U+318F | \p{InHangul_Compatibility_Jamo} |
| U+3190…U+319F | \p{InKanbun} |
| U+31A0…U+31BF | \p{InBopomofo_Extended} |
| U+31F0…U+31FF | \p{InKatakana_Phonetic_Extensions} |
| U+3200…U+32FF | \p{InEnclosed_CJK_Letters_and_Months} |
| U+3300…U+33FF | \p{InCJK_Compatibility} |
| U+3400…U+4DBF | \p{InCJK_Unified_Ideographs_Extension_A} |
| U+4DC0…U+4DFF | \p{InYijing_Hexagram_Symbols} |
| U+4E00…U+9FFF | \p{InCJK_Unified_Ideographs} |
| U+A000…U+A48F | \p{InYi_Syllables} |
| U+A490…U+A4CF | \p{InYi_Radicals} |
| U+AC00…U+D7AF | \p{InHangul_Syllables} |
| U+D800…U+DB7F | \p{InHigh_Surrogates} |
| U+DB80…U+DBFF | \p{InHigh_Private_Use_Surrogates} |
| U+DC00…U+DFFF | \p{InLow_Surrogates} |
| U+E000…U+F8FF | \p{InPrivate_Use_Area} |
| U+F900…U+FAFF | \p{InCJK_Compatibility_Ideographs} |
| U+FB00…U+FB4F | \p{InAlphabetic_Presentation_Forms} |
| U+FB50…U+FDFF | \p{InArabic_Presentation_Forms-A} |
| U+FE00…U+FE0F | \p{InVariation_Selectors} |
| U+FE20…U+FE2F | \p{InCombining_Half_Marks} |
| U+FE30…U+FE4F | \p{InCJK_Compatibility_Forms} |
| U+FE50…U+FE6F | \p{InSmall_Form_Variants} |
| U+FE70…U+FEFF | \p{InArabic_Presentation_Forms-B} |
| U+FF00…U+FFEF | \p{InHalfwidth_and_Fullwidth_Forms} |
| U+FFF0…U+FFFF | \p{InSpecials} |
A Unicode block is a single, contiguous range of code points. Although many blocks have the names of Unicode scripts and Unicode categories, they do not correspond 100% with them. The name of a block only indicates its primary use.
The Currency block does not include the dollar and yen symbols. Those are
found in the Basic_Latin and Latin-1_Supplement blocks, for
historical reasons. Both do have the Currency Symbol property. To
match any currency symbol, use \p{Sc} instead of \p{InCurrency}.
Most blocks include unassigned code points, which are covered by the property \p{Cn}. No other Unicode
property, and none of the Unicode scripts, include unassigned code
points.
The \p{InBlockName} syntax works with .NET and
Perl. Java uses the \p{IsBlockName}
syntax.
Perl also supports the Is
variant, but we recommend you stick with the In syntax, to avoid confusion with Unicode
scripts. For scripts, Perl supports \p{Script} and \p{IsScript}, but not \p{InScript}.
Each Unicode code point, except unassigned ones, is part of exactly one Unicode script. Unassigned code points are not part of any script. The assigned code points up to U+FFFF are assigned to these scripts:
\p{Common} | \p{Katakana} |
\p{Arabic} | \p{Khmer} |
\p{Armenian} | \p{Lao} |
\p{Bengali} | \p{Latin} |
\p{Bopomofo} | \p{Limbu} |
\p{Braille} | \p{Malayalam} |
\p{Buhid} | \p{Mongolian} |
\p{CanadianAboriginal} | \p{Myanmar} |
\p{Cherokee} | \p{Ogham} |
\p{Cyrillic} | \p{Oriya} |
\p{Devanagari} | \p{Runic} |
\p{Ethiopic} | \p{Sinhala} |
\p{Georgian} | \p{Syriac} |
\p{Greek} | \p{Tagalog} |
\p{Gujarati} | \p{Tagbanwa} |
\p{Gurmukhi} | \p{TaiLe} |
\p{Han} | \p{Tamil} |
\p{Hangul} | \p{Telugu} |
\p{Hanunoo} | \p{Thaana} |
\p{Hebrew} | \p{Thai} |
\p{Hiragana} | \p{Tibetan} |
\p{Inherited} | \p{Yi} |
\p{Kannada} |
A script is a group of code points used by a particular human
writing system. Some scripts, such as Thai, correspond with a single human
language. Other scripts, such as Latin, span multiple languages. Some
languages are composed of multiple scripts. For instance, there is
no Japanese Unicode script; instead, Unicode offers the Hiragana, Katakana, Han, and Latin scripts that Japanese documents are
usually composed of.
We listed the Common script
first, out of alphabetical order. This script contains all sorts of
characters that are common to a wide range of scripts, such as
punctuation, whitespace, and miscellaneous symbols.
The difference between code points and characters comes into play when there are combining marks. The Unicode code point U+0061 is “Latin small letter a,” whereas U+00E0 is “Latin small letter a with grave accent.” Both represent what most people would describe as a character.
U+0300 is the “combining grave accent” combining mark. It can
be used sensibly only after a letter. A string consisting of the
Unicode code points U+0061 U+0300 will be displayed as à, just like U+00E0. The combining mark
U+0300 is displayed on top of the character U+0061.
The reason for these two different ways of displaying an accented letter is that many historical character sets encode “a with grave accent” as a single character. Unicode’s designers thought it would be useful to have a one-on-one mapping with popular legacy character sets, in addition to the Unicode way of separating marks and base letters, which makes arbitrary combinations not supported by legacy character sets possible.
What matters to you as a regex user is that all regex flavors
discussed in this book operate on code points rather than graphical
characters. When we say that the regular expression . matches a single character,
it really matches just a single code point. If your subject text
consists of the two code points U+0061 U+0300, which can be
represented as the string literal "\u0061\u0300" in a programming language such
as Java, the dot will match only the code point U+0061, or a, without the accent U+0300. The regex
.. will match
both.
Perl and PCRE offer a special regex token \X, which matches any single Unicode grapheme.
Essentially, it is the Unicode version of the venerable dot. It
matches any Unicode code point that is not a combining mark, along
with all the combining marks that follow it, if any. \P{M}\p{M}* does the same
thing using the Unicode property syntax. \X will find two matches in the text
àà,
regardless of how it is encoded. If it is encoded as "\u00E0\u0061\u0300" the first
match is "\u00E0",
and the second "\u0061\u0300".
The uppercase \P is the negated variant of the lowercase \p. For instance, \P{Sc} matches any character
that does not have the “Currency Symbol” Unicode property. \P is supported by all
flavors that support \p, and for all the properties, block, and
scripts that they support.
All flavors allow all the \u,
\x, \p, and \P tokens they support to be
used inside character classes. The character represented by the code
point, or the characters in the category, block, or script, are then
added to the character class. For instance, you could match a
character that is either an opening quote (initial punctuation
property), a closing quote (final punctuation property), or the
trademark symbol (U+2122) with:
[\p{Pi}\p{Pf}\x{2122}]| Regex options: None |
| Regex flavors: .NET, Java, PCRE, Perl, Ruby 1.9 |
If your regular expression flavor does not support Unicode categories, blocks, or scripts, you can list the characters that are in the category, block, or script in a character class. For blocks this is very easy: each block is simply a range between two code points. The Greek Extended block comprises the characters U+1F00 to U+1FFF:
[\u1F00-\u1FFF]
| Regex options: None |
| Regex flavors: .NET, Java, Javascript, Python |
[\x{1F00}-\x{1FFF}]| Regex options: None |
| Regex flavors: PCRE, Perl, Ruby 1.9 |
For most categories and many scripts, the equivalent character class is a long list of individual code points and short ranges. The characters that comprise each category and many of the scripts are scattered throughout the Unicode table. This is the Greek script:
[\u0370-\u0373\u0375\u0376-\u0377\u037A\u037B-\u037D\u0384\u0386↵ \u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03E1\u03F0-\u03F5\u03F6↵ \u03F7-\u03FF\u1D26-\u1D2A\u1D5D-\u1D61\u1D66-\u1D6A\u1DBF\u1F00-\u1F15↵ \u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D↵ \u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FBC\u1FBD\u1FBE\u1FBF-\u1FC1↵ \u1FC2-\u1FC4\u1FC6-\u1FCC\u1FCD-\u1FCF\u1FD0-\u1FD3\u1FD6-\u1FDB↵ \u1FDD-\u1FDF\u1FE0-\u1FEC\u1FED-\u1FEF\u1FF2-\u1FF4\u1FF6-\u1FFC↵ \u1FFD-\u1FFE\u2126]
| Regex options: None |
| Regex flavors: .NET, Java, Javascript, Python |
We built this regular expression by copying the listing for the Greek script from http://www.unicode.o...ATA/Scripts.txt, searching and replacing with three regular expressions:
Searching for the regular expression
;.*and replacing its matches with nothing deletes the comments. If it deletes everything, undo and turn off “dot matches line breaks.”Searching for
^with “^ and $ match at line breaks” turned on, and replacing with\u, prefixes the code points with\u. Replacing\.\.with-\ucorrects the ranges.Finally, replacing
\s+with nothing removes the line breaks. Adding the brackets around the character class finishes the regex. You may have to add\uat the start of the character class and/or remove it at the end, depending on whether you included any leading or trailing blank lines when copying the listing fromScripts.txt.
This may seem like a lot of work, but it actually took Jan
less than a minute. Writing the description took much longer. Doing
this for the \x{} syntax is just
as easy:
Searching for the regular expression
;.*and replacing its matches with nothing deletes the comments. If it deletes everything, undo and turn off “dot matches line breaks.”Searching for
^with “^ and $ match at line breaks” turned on and replacing with\x{prefixes the code points with\x{. Replacing\.\.with}-\x{corrects the ranges.Finally, replacing
\s+with}adds the closing braces and removes the line breaks. Adding the brackets around the character class finishes the regex. You may have to add\x{at the start of the character class and/or remove it at the end, depending on whether you included any leading or trailing blank lines when copying the listing fromScripts.txt.
The results are:
[\x{0370}-\x{0373}\x{0375}\x{0376}-\x{0377}\x{037A}\x{037B}-\x{037D}↵
\x{0384}\x{0386}\x{0388}-\x{038A}\x{038C}\x{038E}-\x{03A1}↵
\x{03A3}-\x{03E1}\x{03F0}-\x{03F5}\x{03F6}\x{03F7}-\x{03FF}↵
\x{1D26}-\x{1D2A}\x{1D5D}-\x{1D61}\x{1D66}-\x{1D6A}\x{1DBF}↵
\x{1F00}-\x{1F15}\x{1F18}-\x{1F1D}\x{1F20}-\x{1F45}\x{1F48}-\x{1F4D}↵
\x{1F50}-\x{1F57}\x{1F59}\x{1F5B}\x{1F5D}\x{1F5F}-\x{1F7D}↵
\x{1F80}-\x{1FB4}\x{1FB6}-\x{1FBC}\x{1FBD}\x{1FBE}\x{1FBF}-\x{1FC1}↵
\x{1FC2}-\x{1FC4}\x{1FC6}-\x{1FCC}\x{1FCD}-\x{1FCF}\x{1FD0}-\x{1FD3}↵
\x{1FD6}-\x{1FDB}\x{1FDD}-\x{1FDF}\x{1FE0}-\x{1FEC}\x{1FED}-\x{1FEF}↵
\x{1FF2}-\x{1FF4}\x{1FF6}-\x{1FFC}\x{1FFD}-\x{1FFE}\x{2126}↵
\x{10140}-\x{10174}\x{10175}-\x{10178}\x{10179}-\x{10189}↵
\x{1018A}\x{1D200}-\x{1D241}\x{1D242}-\x{1D244}\x{1D245}]| Regex options: None |
| Regex flavors: PCRE, Perl, Ruby 1.9 |
Learn more about this topic from Regular Expressions Cookbook.
This cookbook provides more than 100 recipes to help you crunch data and manipulate text with regular expressions. With recipes for popular programming languages such as C#, Java, Javascript, Perl, PHP, Python, Ruby, and VB.NET, Regular Expressions Cookbook will help you learn powerful new tricks, avoid language-specific gotchas, and save valuable time with this library of proven solutions to difficult, real-world problems.

Help

