Delving Deeper
Regular Expressions
Unicode
Under Python 3, str
objects use the full Unicode character set, and regular expression processing on a str
assumes that the pattern and input text are both Unicode. The escape codes described earlier are defined in terms of Unicode by default. Those assumptions mean that the pattern \w+
will match both the words "French" and "Français". To restrict escape codes to the ASCII character set, as was the default in Python 2, use the ASCII
flag when compiling the pattern or when calling the module-level functions search()
and match()
.
# re_flags_ascii.py import re text = u'Français złoty Österreich' pattern = r'\w+' ascii_pattern = re.compile(pattern, re.ASCII) unicode_pattern = re.compile(pattern) print('Text :', text) print('Pattern :', pattern) print('ASCII :', list(ascii_pattern.findall(text))) print('Unicode :', list(unicode_pattern.findall(text)))
The other escape sequences (\W
, \b
, \B
, \d
, \D
, \s
, and \S
) are also processed differently for ASCII text. Instead of consulting the Unicode database to find the properties of each character, re
uses the ASCII definition of the character set.
$ python3 re_flags_ascii.py Text : Français złoty Österreich Pattern : \w+ ASCII : ['Fran', 'ais', 'z', 'oty', 'sterreich'] Unicode : ['Français', 'złoty', 'Österreich']