Syntax and Usage

A regular expression (or "regex") is a character sequence used to search for patterns within strings. You've already seen examples of pattern searching when we looked at strings. Regular expressions have their own syntax, which enables more general and flexible constructs to search with.

The "re" module in Python is the tool that will be used to build regular expressions in this unit. Practice these examples to familiarize yourself with some common methods. You will also practice building more general regular expression patterns using the table of special characters.

Predefined Character Classes

In addition to the dot, there are more predefined classes of characters available in Python for cases that commonly appear in regular expressions. For instance, these can be used to match any digit or any non-digit. Predefined classes are denoted by a backslash followed by a particular character, like \d for a single decimal digit, so the characters 0 to 9. The following table lists the most important predefined classes:


Predefined Character Classes
Predefined class Description
\d stands for any decimal digit 0…9 
\D stands for any character that is not a digit  
\s stands for any whitespace character (whitespace characters include the space, tab, and newline character)  
\S stands for any non-whitespace character 
\w stands for any alphanumeric character (alphanumeric characters are all Latin letters a-z and A-Z, Arabic digits 0…9, and the underscore character) 
\W stands for any non-alphanumeric character 

To give one example, the following pattern can be used to get all names in which "John" appears not as a single word but as part of a longer name (either first or last name). This means it is followed by at least one character that is not a whitespace which is represented by the \S in the regular expression used. The only name that matches this pattern is "Jennifer Johnson".


  1
pattern = ".*John\S"

In addition to the *, there are more special characters for denoting certain cases of repetitions of a character or a group. + stands for arbitrarily many occurrences but, in contrast to *, the character or group needs to occur at least once. ? stands for zero or one occurrence of the character or group. That means it is used when a character or sequence of characters is optional in a pattern. Finally, the most general form {m,n} says that the previous character or group needs to occur at least m times and at most n times.

If we use ".+John" instead of ".*John" in an earlier example, we will only get the names that contain "John" but preceded by one or more other characters.

  1
pattern = ".+John"

Output:


Jennifer Johnson
Papa John
Walter John Miller

By writing ...

  1
pattern = ".{11,11}[A-Z]"

... we get all names that have an upper-case character as the 12th character. The result will be "Kermit the Frog". This is a bit easier and less error-prone than writing "...........[A-Z]".

Lastly, the pattern ".*li?a" can be used to get all names that contain the character sequences 'la' or 'lia'.

  1
pattern = ".*li?a"

Output:


Julia Smith
John Williams
Rebecca Clark

So far we have only used the different repetition matching operators *, +, {m,n}, and ? for occurrences of a single specific character. When used after a class, these operators stand for a certain number of occurrences of characters from that class. For instance, the following pattern can be used to search for names that contain a word that only consists of lower-case letters (a-z) like "Kermit the Frog" and "Vincent van Gogh". We use \s to represent the required whitespaces before and after the word and then [a-z]+ for an arbitrarily long sequence of lower-case letters but consisting of at least one letter.

  1
pattern = ".*\s[a-z]+\s"

Sequences of characters can be grouped together with the help of parentheses (...) and then be followed by a repetition operator to represent a certain number of occurrences of that sequence of characters. For instance, the following pattern can be used to get all names where the first name starts with the letter 'M' taking into account that names may have a 'Dr. ' as prefix. In the pattern, we use the group (Dr.\s) followed by the ? operator to say that the name can start with that group but doesn't have to. Then we have the upper-case M followed by .*\s to make sure there is a white space character later in the string so that we can be reasonably sure this is the first name.

  1
pattern = "(Dr.\s)?M.*\s"

Output:


Michael Mason
Dr. Melissa Franklin

You may have noticed that there is a person with two doctor titles in the list whose first name also starts with an 'M' and that it is currently not captured by the pattern because the ? operator will match at most one occurrence of the group. By changing the ? to a * , we can match an arbitrary number of doctor titles.

  1
pattern = "(Dr.\s)*M.*\s"

Output:


Michael Mason
Dr. Melissa Franklin
Dr. Dr. Matthew Malone

Similar to how we have the if-else statement to realize case distinctions in addition to loop based repetitions in normal Python, regular expression can make use of the | character to define alternatives. For instance, (nn|ss) can be used to get all names that contain either the sequence "nn" or the sequence "ss" (or both):

  1
pattern = ".*(nn|ss)"

Output:


Jennifer Johnson
Susanne Walker
Dr. Melissa Franklin

As we already mentioned, ^ and $ represent the beginning and end of a string, respectively. Let's say we want to get all names from the list that end in "John". This can be done using the following regular expression:

  1
pattern = ".*John$"

Output:


Papa John

Here is a more complicated example. We want all names that contain "John" as a single word independent of whether "John" appears at the beginning, somewhere in the middle, or at the end of the name. However, we want to exclude cases where "John" appears as part of longer word (like "Johnson"). A first idea could be to use ".*\sJohn\s" to achieve this making sure that there are whitespace characters before and after "John". However, this will match neither "John Williams" nor "Papa John" because the beginning and end of the string are not whitespace characters. What we can do is use the pattern "(^|.*\s)John" to say that John needs to be preceded either by the beginning of the string or an arbitrary sequence of characters followed by a whitespace. Similarly, "John(\s|$)" requires that John is succeeded either by a whitespace or by the end of the string. Taken together we get the following regular expressions:

  1
  
pattern = "(^|.*\s)John(\s|$)"

Output:


John Williams
Papa John
Walter John Miller

An alternative would be to use the regular expression "(.*\s)?John(\s.*)?$" That uses the optional operator ? rather than | . There are often several ways to express the same thing in a regular expression. Also, as you start to see here, the different special matching operators can be combined and nested to form arbitrarily complex regular expression. You will practice writing regular expressions like this a bit more in the practice exercises and in the homework assignment.

In addition to the main special characters we explained in this section, there are certain extension operators available denoted as (?x...) where the x can be one of several special characters determining the meaning of the operator. We here just briefly want to mention the operator (?!...) for negative lookahead assertion because we will use it later in the lesson's walkthrough to filter files in a folder. Negative lookahead extension means that what comes before the (?!...) can only be matched if it isn't followed by the expression given for the ... . For instance, if we want to find all names that contain John but not followed by "son" as in "Johnson", we could use the following expression:

  1
pattern = ".*John(?!son)"

Output:


John Williams
Papa John
Walter John Miller

If match(...) does not find a match, it will return the special value None. That's why we can use it with an if-statement as we have been doing in all the previous examples. However, if a match is found it will not simply return True but a match object that can be used to get further information, for instance about which part of the string matched the pattern. The match object provides the methods group() for getting the matched part as a string, start() for getting the character index of the starting position of the match, end() for getting the character index of the end position of the match, and span() to get both start and end indices as a tuple. The example below shows how one would use the returned matching object to get further information and the output produced by its four methods for the pattern "John" matching the string "John Williams":

  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
pattern = "John"
compiledRE = re.compile(pattern) 
 
for person in personList: 
     match = compiledRE.match(person) 
     if match: 
         print(match.group()) 
         print(match.start()) 
         print(match.end()) 
         print(match.span())

Output:


John <- output of group() >
0 <- output of start()
4 <- output of end() (0,4) <- output of span()

In addition to match(...), there are three more matching functions defined in the re module. Like match(...), these all exist as standalone functions taking a regular expression and a string as parameters, and as methods to be invoked for a compiled pattern. Here is a brief overview:

  • search(...) - In contrast to match(...), search(...) tries to find matching locations anywhere within the string not just matches starting at the beginning. That means "^John" used with search(...) corresponds to "John" used with match(...), and ".*John" used with match(...) corresponds to "John" used with search(...). However, "corresponds" here only means that a match will be found in exactly the same cases but the output by the different methods of the returned matching object will still vary.
  • findall(...) - In contrast to match(...) and search(...), findall(...) will identify all substrings in the given string that match the regular expression and return these matches as a list.
  • finditer(...) - finditer(...) works like findall(...) but returns the matches found not as a list but as a so-called iterator object.

  • By now you should have enough understanding of regular expressions to cover maybe ~80 to 90% of the cases that you encounter in typical programming. However, there are quite a few additional aspects and details that we did not cover here that you potentially need when dealing with rather sophisticated cases of regular expression based matching. The full documentation of the "re" package can be found here(link is external) and is always a good source for looking up details when needed. In addition, this HOWTO(link is external) provides a good overview.

    We also want to mention that regular expressions are very common in programming and matching with them is very efficient, but they do have certain limitations in their expressivity. For instance, it is impossible to write a regular expression for names with the first and last name starting with the same character. Or, you cannot define a regular pattern for all strings that are palindromes, so words that read the same forward and backward. For these kinds of patterns, certain extensions to the concept of a regular expression are needed. One generalization of regular expressions are what are called recursive regular expressions. The regex(link is external) Python package currently under development, backward compatible to re, and planned to replace re at some point, has this capability, so feel free to check it out if you are interested in this topic.