The subject of regular expressions is quite deep, and it takes an immense amount of practice to get used to the special character syntax. Furthermore, the re module contains a vast set of methods available for performing searches using regular expressions. Upon completing the examples in this section, you should have a much deeper appreciation for how powerful regular expressions can be.
Looking Ahead or Behind
In many cases, it is useful to match a part of a pattern only if some other part will also match. For example, in the email parsing expression, the angle brackets were marked as optional. Realistically, the brackets should be paired, and the expression
should match only if both are present, or neither is. This modified version of the expression uses a positive look ahead assertion to match the pair. The look ahead assertion syntax is
# re_look_ahead.py import re address = re.compile( ''' # A name is made up of letters, and may include "." # for title abbreviations and middle initials. ((?P
([\w.,]+\s+)*[\w.,]+ ) \s+ ) # name is no longer optional # LOOKAHEAD # Email addresses are wrapped in angle brackets, but only # if both are present or neither is. (?= (<.*>$) # remainder wrapped in angle brackets | ([^<].*[^>]$) # remainder *not* wrapped in angle brackets ) [\w\d.+-]+ # username @ ([\w\d.]+\.)+ # domain name prefix (com|org|edu) # limit the allowed top-level domains ) >? # optional closing angle bracket ''', re.VERBOSE) candidates = [ u'First Last ', u'No Brackets email@example.com', u'Open Bracket ', ] for candidate in candidates: print('Candidate:', candidate) match = address.search(candidate) if match: print(' Name :', match.groupdict()['name']) print(' Email:', match.groupdict()['email']) else: print(' No match')
There are several important changes in this version of the expression. First, the name portion is no longer optional. That means stand-alone addresses do not match, but it also prevents improperly formatted name/address combinations from matching. The positive look ahead rule after the "name" group asserts that either the remainder of the string is wrapped with a pair of angle brackets, or there is not a mismatched bracket; either both of or neither of the brackets is present. The look ahead is expressed as a group, but the match for a look ahead group does not consume any of the input text, so the rest of the pattern picks up from the same spot after the look ahead matches.
$ python3 re_look_ahead.py Candidate: First Last
Name : First Last Email: firstname.lastname@example.org Candidate: No Brackets email@example.com Name : No Brackets Email: firstname.lastname@example.org Candidate: Open Bracket No match/pre>
A negative look ahead assertion (
) says that the pattern does not match the text following the current point. For example, the email recognition pattern could be modified to ignore the
mailing addresses commonly used by automated systems.
# re_negative_look_ahead.py import re address = re.compile( ''' ^ # An address: email@example.com # Ignore noreply addresses (?!noreply@.*$) [\w\d.+-]+ # username @ ([\w\d.]+\.)+ # domain name prefix (com|org|edu) # limit the allowed top-level domains $ ''', re.VERBOSE) candidates = [ firstname.lastname@example.org', email@example.com', ] for candidate in candidates: print('Candidate:', candidate) match = address.search(candidate) if match: print(' Match:', candidate[match.start():match.end()]) else: print(' No match')
The address starting with
noreply does not match the pattern, since the look ahead assertion fails.
$ python3 re_negative_look_ahead.py Candidate: firstname.lastname@example.org Match: email@example.com Candidate: firstname.lastname@example.org No match
Instead of looking ahead for
noreply in the username portion of the email address, the pattern can alternatively be written using a negative look behind assertion
after the username is matched using the syntax
# re_negative_look_behind.py import re address = re.compile( ''' ^ # An address: email@example.com [\w\d.+-]+ # username # Ignore noreply addresses (?
Looking backward works a little differently than looking ahead, in that the expression must use a fixed-length pattern. Repetitions are allowed, as long as there is a fixed number of them (no wildcards or ranges).
$ python3 re_negative_look_behind.py Candidate: firstname.lastname@example.org Match: email@example.com Candidate: firstname.lastname@example.org No match
A positive look behind assertion can be used to find text following a pattern using the syntax
In the following example, the expression finds Twitter handles.
# re_look_behind.py import re twitter = re.compile( ''' # A twitter handle: @username (?<=@) ([\w\d_]+) # username ''', re.VERBOSE) text = '''This text includes two Twitter handles. One for @ThePSF, and one for the author, @doughellmann. ''' print(text) for match in twitter.findall(text): print('Handle:', match)
The pattern matches sequences of characters that can make up a Twitter handle, as long as they are preceded by an
$ python3 re_look_behind.py This text includes two Twitter handles. One for @ThePSF, and one for the author, @doughellmann. Handle: ThePSF Handle: doughellmann