Jump to content

How to use Unicode code points, properties, blocks, and scripts in regular expressions

0
  Jan Goyvaerts's Photo
Posted Sep 16 2009 01:05 PM

The following are common uses for Unicode code points, properties, blocks and scripts:

Use a regular exp​ression to find the trademark sign (™) by specifying its Unicode code point rather than copying and pasting an actual trademark sign. If you like copy and paste, the trademark sign is just another literal character, even though you cannot type it directly on your keyboard.

Create a regular exp​ression that matches any character that has the “Currency Symbol” Unicode property. Unicode properties are also called Unicode categories.

Create a regular exp​ression that matches any character in the “Greek Extended” Unicode block.

Create a regular exp​ression that matches any character that, according to the Unicode standard, is part of the Greek script.

Create a regular exp​ression that matches a grapheme, or what is commonly thought of as a character: a base character with all its combining marks.


Here are sample solutions for the various flavors:


Unicode code point


\u2122
Regex options: None
Regex flavors: .NET, Java, Javascript, Python

This regex works in Python only when quoted as a Unicode string: u"\u2122".


\x{2122}
Regex options: None
Regex flavors: PCRE, Perl, Ruby 1.9

PCRE must be compiled with UTF-8 support; in PHP, turn on UTF-8 support with the /u pattern modifier. Ruby 1.8 does not support Unicode regular exp​ressions.


Unicode property or category


\p{Sc}
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Ruby 1.9

PCRE must be compiled with UTF-8 support; in PHP, turn on UTF-8 support with the /u pattern modifier. Javascript and Python do not support Unicode properties. Ruby 1.8 does not support Unicode regular exp​ressions.


Unicode block


\p{IsGreekExtended}
Regex options: None
Regex flavors: .NET, Perl

\p{InGreekExtended}
Regex options: None
Regex flavors: Java, Perl

Javascript, PCRE, Python, and Ruby do not support Unicode blocks.


Unicode script


\p{Greek}
Regex options: None
Regex flavors: PCRE, Perl, Ruby 1.9

Unicode script support requires PCRE 6.5 or later, and PCRE must be compiled with UTF-8 support. In PHP, turn on UTF-8 support with the /u pattern modifier. .NET, Javascript, and Python do not support Unicode properties. Ruby 1.8 does not support Unicode regular exp​ressions.


Unicode grapheme


\X
Regex options: None
Regex flavors: PCRE, Perl

PCRE and Perl have a dedicated token for matching graphemes, but also support the workaround syntax using Unicode properties.

\P{M}\p{M}*
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Ruby 1.9

PCRE must be compiled with UTF-8 support; in PHP, turn on UTF-8 support with the /u pattern modifier. Javascript and Python do not support Unicode properties. Ruby 1.8 does not support Unicode regular exp​ressions.


Unicode code point

A code point is one entry in the Unicode character database. A code point is not the same as a character, depending on the meaning you give to “character.” What appears as a character on screen is called a grapheme in Unicode.

The Unicode code point U+2122 represents the “trademark sign” character. You can match this with \u2122 or \x{2122}, depending on the regex flavor you’re working with.

The \u syntax requires exactly four hexadecimal digits. This means you can only use it for Unicode code points U+0000 through U+FFFF. The \x syntax allows any number of hexadecimal digits, supporting all code points U+000000 through U+10FFFF. You can match U+00E0 with \x{E0} or \x{00E0}. Code points U+100000 are used very infrequently, and are poorly supported by fonts and operating systems.

Code points can be used inside and outside character classes.


Unicode property or category

Each Unicode code point has exactly one Unicode property, or fits into a single Unicode category. These terms mean the same thing. There are 30 Unicode categories, grouped into 7 super-categories:

\p{L}

Any kind of letter from any language

\p{Ll}

A lowercase letter that has an uppercase variant

\p{Lt}

A letter that appears at the start of a word when only the first letter of the word is capitalized

\p{Lo}

A letter or ideograph that does not have lowercase and uppercase variants

\p{M}

A character intended to be combined with another character (accents, umlauts, enclosing boxes, etc.)

\p{Mn}

A character intended to be combined with another character that does not take up extra space (e.g., accents, umlauts, etc.)

\p{Me}

A character that encloses another character (circle, square, keycap, etc.)

\p{Z}

Any kind of whitespace or invisible separator

\p{Zs}

A whitespace character that is invisible, but does take up space

\p{Zl}

The line separator character U+2028

\p{Zp}

The paragraph separator character U+2029

\p{S}

Math symbols, currency signs, dingbats, box-drawing characters, etc.

\p{Sm}

Any mathematical symbol

\p{Sc}

Any currency sign

\p{Sk}

A combining character (mark) as a full character on its own

\p{So}

Various symbols that are not math symbols, currency signs, or combining characters

\p{N}

Any kind of numeric character in any script

\p{Nd}

A digit 0 through 9 in any script except ideographic scripts

\p{Nl}

A number that looks like a letter, such as a Roman numeral

\p{No}

A superscript or subscript digit, or a number that is not a digit 0…9 (excluding numbers from ideographic scripts)

\p{P}

Any kind of punctuation character

\p{Pd}

Any kind of hyphen or dash

\p{Ps}

Any kind of opening bracket

\p{Pe}

Any kind of closing bracket

\p{Pi}

Any kind of opening quote

\p{Pf}

Any kind of closing quote

\p{Pc}

A punctuation character such as an underscore that connects words

\p{Po}

Any kind of punctuation character that is not a dash, bracket, quote or connector

\p{C}

Invisible control characters and unused code points

\p{Cc}

An ASCII 0x00…0x1F or Latin-1 0x80…0x9F control character

\p{Cf}

An invisible formatting indicator

\p{Co}

Any code point reserved for private use

\p{Cs}

One half of a surrogate pair in UTF-16 encoding

\p{Cn}

Any code point to which no character has been assigned

\p{Ll} matches a single code point that has the Ll, or “lowercase letter,” property. \p{L} is a quick way of writing [\p{Ll}\p{Lu}\p{Lt}\p{Lm}\p{Lo}] that matches a single code point in any of the “letter” categories.

\P is the negated version of \p. \P{Ll} matches a single code point that does not have the Ll property. \P{L} matches a single code point that does not have any of the “letter” properties. This is not the same as [\P{Ll}\P{Lu}\P{Lt}\P{Lm}\P{Lo}], which matches all code points. \P{Ll} matches the code points with the Lu property (and every other property except Ll), whereas \P{Lu} includes the Ll code points. Combining just these two in a code point class already matches all possible code points.


Unicode block

The Unicode character database divides all the code points into blocks. Each block consists of a single range of code points. The code points U+0000 through U+FFFF are divided into 105 blocks:

U+0000…U+007F\p{InBasic_Latin}
U+0080…U+00FF\p{InLatin-1_Supplement}
U+0100…U+017F\p{InLatin_Extended-A}
U+0180…U+024F\p{InLatin_Extended-B}
U+0250…U+02AF\p{InIPA_Extensions}
U+02B0…U+02FF\p{InSpacing_Modifier_Letters}
U+0300…U+036F\p{InCombining_Diacritical_Marks}
U+0370…U+03FF\p{InGreek_and_Coptic}
U+0400…U+04FF\p{InCyrillic}
U+0500…U+052F\p{InCyrillic_Supplementary}
U+0530…U+058F\p{InArmenian}
U+0590…U+05FF\p{InHebrew}
U+0600…U+06FF\p{InArabic}
U+0700…U+074F\p{InSyriac}
U+0780…U+07BF\p{InThaana}
U+0900…U+097F\p{InDevanagari}
U+0980…U+09FF\p{InBengali}
U+0A00…U+0A7F\p{InGurmukhi}
U+0A80…U+0AFF\p{InGujarati}
U+0B00…U+0B7F\p{InOriya}
U+0B80…U+0BFF\p{InTamil}
U+0C00…U+0C7F\p{InTelugu}
U+0C80…U+0CFF\p{InKannada}
U+0D00…U+0D7F\p{InMalayalam}
U+0D80…U+0DFF\p{InSinhala}
U+0E00…U+0E7F\p{InThai}
U+0E80…U+0EFF\p{InLao}
U+0F00…U+0FFF\p{InTibetan}
U+1000…U+109F\p{InMyanmar}
U+10A0…U+10FF\p{InGeorgian}
U+1100…U+11FF\p{InHangul_Jamo}
U+1200…U+137F\p{InEthiopic}
U+13A0…U+13FF\p{InCherokee}
U+1400…U+167F\p{InUnified_Canadian_Aboriginal_Syllabics}
U+1680…U+169F\p{InOgham}
U+16A0…U+16FF\p{InRunic}
U+1700…U+171F\p{InTagalog}
U+1720…U+173F\p{InHanunoo}
U+1740…U+175F\p{InBuhid}
U+1760…U+177F\p{InTagbanwa}
U+1780…U+17FF\p{InKhmer}
U+1800…U+18AF\p{InMongolian}
U+1900…U+194F\p{InLimbu}
U+1950…U+197F\p{InTai_Le}
U+19E0…U+19FF\p{InKhmer_Symbols}
U+1D00…U+1D7F\p{InPhonetic_Extensions}
U+1E00…U+1EFF\p{InLatin_Extended_Additional}
U+1F00…U+1FFF\p{InGreek_Extended}
U+2000…U+206F\p{InGeneral_Punctuation}
U+2070…U+209F\p{InSuperscripts_and_Subscripts}
U+20A0…U+20CF\p{InCurrency_Symbols}
U+20D0…U+20FF\p{InCombining_Diacritical_Marks_for_Symbols}
U+2100…U+214F\p{InLetterlike_Symbols}
U+2150…U+218F\p{InNumber_Forms}
U+2190…U+21FF\p{InArrows}
U+2200…U+22FF\p{InMathematical_Operators}
U+2300…U+23FF\p{InMiscellaneous_Technical}
U+2400…U+243F\p{InControl_Pictures}
U+2440…U+245F\p{InOptical_Character_Recognition}
U+2460…U+24FF\p{InEnclosed_Alphanumerics}
U+2500…U+257F\p{InBox_Drawing}
U+2580…U+259F\p{InBlock_Elements}
U+25A0…U+25FF\p{InGeometric_Shapes}
U+2600…U+26FF\p{InMiscellaneous_Symbols}
U+2700…U+27BF\p{InDingbats}
U+27C0…U+27EF\p{InMiscellaneous_Mathematical_Symbols-A}
U+27F0…U+27FF\p{InSupplemental_Arrows-A}
U+2800…U+28FF\p{InBraille_Patterns}
U+2900…U+297F\p{InSupplemental_Arrows-B}
U+2980…U+29FF\p{InMiscellaneous_Mathematical_Symbols-B}
U+2A00…U+2AFF\p{InSupplemental_Mathematical_Operators}
U+2B00…U+2BFF\p{InMiscellaneous_Symbols_and_Arrows}
U+2E80…U+2EFF\p{InCJK_Radicals_Supplement}
U+2F00…U+2FDF\p{InKangxi_Radicals}
U+2FF0…U+2FFF\p{InIdeographic_Description_Characters}
U+3000…U+303F\p{InCJK_Symbols_and_Punctuation}
U+3040…U+309F\p{InHiragana}
U+30A0…U+30FF\p{InKatakana}
U+3100…U+312F\p{InBopomofo}
U+3130…U+318F\p{InHangul_Compatibility_Jamo}
U+3190…U+319F\p{InKanbun}
U+31A0…U+31BF\p{InBopomofo_Extended}
U+31F0…U+31FF\p{InKatakana_Phonetic_Extensions}
U+3200…U+32FF\p{InEnclosed_CJK_Letters_and_Months}
U+3300…U+33FF\p{InCJK_Compatibility}
U+3400…U+4DBF\p{InCJK_Unified_Ideographs_Extension_A}
U+4DC0…U+4DFF\p{InYijing_Hexagram_Symbols}
U+4E00…U+9FFF\p{InCJK_Unified_Ideographs}
U+A000…U+A48F\p{InYi_Syllables}
U+A490…U+A4CF\p{InYi_Radicals}
U+AC00…U+D7AF\p{InHangul_Syllables}
U+D800…U+DB7F\p{InHigh_Surrogates}
U+DB80…U+DBFF\p{InHigh_Private_Use_Surrogates}
U+DC00…U+DFFF\p{InLow_Surrogates}
U+E000…U+F8FF\p{InPrivate_Use_Area}
U+F900…U+FAFF\p{InCJK_Compatibility_Ideographs}
U+FB00…U+FB4F\p{InAlphabetic_Presentation_Forms}
U+FB50…U+FDFF\p{InArabic_Presentation_Forms-A}
U+FE00…U+FE0F\p{InVariation_Selectors}
U+FE20…U+FE2F\p{InCombining_Half_Marks}
U+FE30…U+FE4F\p{InCJK_Compatibility_Forms}
U+FE50…U+FE6F\p{InSmall_Form_Variants}
U+FE70…U+FEFF\p{InArabic_Presentation_Forms-B}
U+FF00…U+FFEF\p{InHalfwidth_and_Fullwidth_Forms}
U+FFF0…U+FFFF\p{InSpecials}

A Unicode block is a single, contiguous range of code points. Although many blocks have the names of Unicode scripts and Unicode categories, they do not correspond 100% with them. The name of a block only indicates its primary use.

The Currency block does not include the dollar and yen symbols. Those are found in the Basic_Latin and Latin-1_Supplement blocks, for historical reasons. Both do have the Currency Symbol property. To match any currency symbol, use \p{Sc} instead of \p{InCurrency}.

Most blocks include unassigned code points, which are covered by the property \p{Cn}. No other Unicode property, and none of the Unicode scripts, include unassigned code points.

The \p{InBlockName} syntax works with .NET and Perl. Java uses the \p{IsBlockName} syntax.

Perl also supports the Is variant, but we recommend you stick with the In syntax, to avoid confusion with Unicode scripts. For scripts, Perl supports \p{Script} and \p{IsScript}, but not \p{InScript}.


Unicode script

Each Unicode code point, except unassigned ones, is part of exactly one Unicode script. Unassigned code points are not part of any script. The assigned code points up to U+FFFF are assigned to these scripts:

\p{Common}\p{Katakana}
\p{Arabic}\p{Khmer}
\p{Armenian}\p{Lao}
\p{Bengali}\p{Latin}
\p{Bopomofo}\p{Limbu}
\p{Braille}\p{Malayalam}
\p{Buhid}\p{Mongolian}
\p{CanadianAboriginal}\p{Myanmar}
\p{Cherokee}\p{Ogham}
\p{Cyrillic}\p{Oriya}
\p{Devanagari}\p{Runic}
\p{Ethiopic}\p{Sinhala}
\p{Georgian}\p{Syriac}
\p{Greek}\p{Tagalog}
\p{Gujarati}\p{Tagbanwa}
\p{Gurmukhi}\p{TaiLe}
\p{Han}\p{Tamil}
\p{Hangul}\p{Telugu}
\p{Hanunoo}\p{Thaana}
\p{Hebrew}\p{Thai}
\p{Hiragana}\p{Tibetan}
\p{Inherited}\p{Yi}
\p{Kannada}

A script is a group of code points used by a particular human writing system. Some scripts, such as Thai, correspond with a single human language. Other scripts, such as Latin, span multiple languages. Some languages are composed of multiple scripts. For instance, there is no Japanese Unicode script; instead, Unicode offers the Hiragana, Katakana, Han, and Latin scripts that Japanese documents are usually composed of.

We listed the Common script first, out of alphabetical order. This script contains all sorts of characters that are common to a wide range of scripts, such as punctuation, whitespace, and miscellaneous symbols.


Unicode grapheme

The difference between code points and characters comes into play when there are combining marks. The Unicode code point U+0061 is “Latin small letter a,” whereas U+00E0 is “Latin small letter a with grave accent.” Both represent what most people would describe as a character.

U+0300 is the “combining grave accent” combining mark. It can be used sensibly only after a letter. A string consisting of the Unicode code points U+0061 U+0300 will be displayed as à, just like U+00E0. The combining mark U+0300 is displayed on top of the character U+0061.

The reason for these two different ways of displaying an accented letter is that many historical character sets encode “a with grave accent” as a single character. Unicode’s designers thought it would be useful to have a one-on-one mapping with popular legacy character sets, in addition to the Unicode way of separating marks and base letters, which makes arbitrary combinations not supported by legacy character sets possible.

What matters to you as a regex user is that all regex flavors discussed in this book operate on code points rather than graphical characters. When we say that the regular exp​ression . matches a single character, it really matches just a single code point. If your subject text consists of the two code points U+0061 U+0300, which can be represented as the string literal "\u0061\u0300" in a programming language such as Java, the dot will match only the code point U+0061, or a, without the accent U+0300. The regex .. will match both.

Perl and PCRE offer a special regex token \X, which matches any single Unicode grapheme. Essentially, it is the Unicode version of the venerable dot. It matches any Unicode code point that is not a combining mark, along with all the combining marks that follow it, if any. \P{M}\p{M}* does the same thing using the Unicode property syntax. \X will find two matches in the text àà, regardless of how it is encoded. If it is encoded as "\u00E0\u0061\u0300" the first match is "\u00E0", and the second "\u0061\u0300".


Variations


Negated variant

The uppercase \P is the negated variant of the lowercase \p. For instance, \P{Sc} matches any character that does not have the “Currency Symbol” Unicode property. \P is supported by all flavors that support \p, and for all the properties, block, and scripts that they support.


Character classes

All flavors allow all the \u, \x, \p, and \P tokens they support to be used inside character classes. The character represented by the code point, or the characters in the category, block, or script, are then added to the character class. For instance, you could match a character that is either an opening quote (initial punctuation property), a closing quote (final punctuation property), or the trademark symbol (U+2122) with:

[\p{Pi}\p{Pf}\x{2122}]
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Ruby 1.9

Listing all characters

If your regular exp​ression flavor does not support Unicode categories, blocks, or scripts, you can list the characters that are in the category, block, or script in a character class. For blocks this is very easy: each block is simply a range between two code points. The Greek Extended block comprises the characters U+1F00 to U+1FFF:

[\u1F00-\u1FFF]
Regex options: None
Regex flavors: .NET, Java, Javascript, Python

[\x{1F00}-\x{1FFF}]
Regex options: None
Regex flavors: PCRE, Perl, Ruby 1.9

For most categories and many scripts, the equivalent character class is a long list of individual code points and short ranges. The characters that comprise each category and many of the scripts are scattered throughout the Unicode table. This is the Greek script:

[\u0370-\u0373\u0375\u0376-\u0377\u037A\u037B-\u037D\u0384\u0386↵

\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03E1\u03F0-\u03F5\u03F6↵

\u03F7-\u03FF\u1D26-\u1D2A\u1D5D-\u1D61\u1D66-\u1D6A\u1DBF\u1F00-\u1F15↵

\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D↵

\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FBC\u1FBD\u1FBE\u1FBF-\u1FC1↵

\u1FC2-\u1FC4\u1FC6-\u1FCC\u1FCD-\u1FCF\u1FD0-\u1FD3\u1FD6-\u1FDB↵

\u1FDD-\u1FDF\u1FE0-\u1FEC\u1FED-\u1FEF\u1FF2-\u1FF4\u1FF6-\u1FFC↵

\u1FFD-\u1FFE\u2126]
Regex options: None
Regex flavors: .NET, Java, Javascript, Python

We built this regular exp​ression by copying the listing for the Greek script from http://www.unicode.o...ATA/Scripts.txt, searching and replacing with three regular exp​ressions:

  1. Searching for the regular exp​ression ;.* and replacing its matches with nothing deletes the comments. If it deletes everything, undo and turn off “dot matches line breaks.”

  2. Searching for ^ with “^ and $ match at line breaks” turned on, and replacing with \u, prefixes the code points with \u. Replacing \.\. with -\u corrects the ranges.

  3. Finally, replacing \s+ with nothing removes the line breaks. Adding the brackets around the character class finishes the regex. You may have to add \u at the start of the character class and/or remove it at the end, depending on whether you included any leading or trailing blank lines when copying the listing from Scripts.txt.

This may seem like a lot of work, but it actually took Jan less than a minute. Writing the description took much longer. Doing this for the \x{} syntax is just as easy:

  1. Searching for the regular exp​ression ;.* and replacing its matches with nothing deletes the comments. If it deletes everything, undo and turn off “dot matches line breaks.”

  2. Searching for ^ with “^ and $ match at line breaks” turned on and replacing with \x{ prefixes the code points with \x{. Replacing \.\. with }-\x{ corrects the ranges.

  3. Finally, replacing \s+ with } adds the closing braces and removes the line breaks. Adding the brackets around the character class finishes the regex. You may have to add \x{ at the start of the character class and/or remove it at the end, depending on whether you included any leading or trailing blank lines when copying the listing from Scripts.txt.

The results are:

[\x{0370}-\x{0373}\x{0375}\x{0376}-\x{0377}\x{037A}\x{037B}-\x{037D}↵

\x{0384}\x{0386}\x{0388}-\x{038A}\x{038C}\x{038E}-\x{03A1}↵

\x{03A3}-\x{03E1}\x{03F0}-\x{03F5}\x{03F6}\x{03F7}-\x{03FF}↵

\x{1D26}-\x{1D2A}\x{1D5D}-\x{1D61}\x{1D66}-\x{1D6A}\x{1DBF}↵

\x{1F00}-\x{1F15}\x{1F18}-\x{1F1D}\x{1F20}-\x{1F45}\x{1F48}-\x{1F4D}↵

\x{1F50}-\x{1F57}\x{1F59}\x{1F5B}\x{1F5D}\x{1F5F}-\x{1F7D}↵

\x{1F80}-\x{1FB4}\x{1FB6}-\x{1FBC}\x{1FBD}\x{1FBE}\x{1FBF}-\x{1FC1}↵

\x{1FC2}-\x{1FC4}\x{1FC6}-\x{1FCC}\x{1FCD}-\x{1FCF}\x{1FD0}-\x{1FD3}↵

\x{1FD6}-\x{1FDB}\x{1FDD}-\x{1FDF}\x{1FE0}-\x{1FEC}\x{1FED}-\x{1FEF}↵

\x{1FF2}-\x{1FF4}\x{1FF6}-\x{1FFC}\x{1FFD}-\x{1FFE}\x{2126}↵

\x{10140}-\x{10174}\x{10175}-\x{10178}\x{10179}-\x{10189}↵

\x{1018A}\x{1D200}-\x{1D241}\x{1D242}-\x{1D244}\x{1D245}]
Regex options: None
Regex flavors: PCRE, Perl, Ruby 1.9
Regular Exp<b></b>ressions Cookbook

Learn more about this topic from Regular Expressions Cookbook.

This cookbook provides more than 100 recipes to help you crunch data and manipulate text with regular expressions. With recipes for popular programming languages such as C#, Java, Javascript, Perl, PHP, Python, Ruby, and VB.NET, Regular Expressions Cookbook will help you learn powerful new tricks, avoid language-specific gotchas, and save valuable time with this library of proven solutions to difficult, real-world problems.

See what you'll learn


0 Replies