Syntax and Usage

Regular Expressions

To start off Lesson 3, we want to talk about a situation that you regularly encounter in programming: Often you need to find a string or all strings that match a particular pattern among a given set of strings.

For instance, you may have a list of names of persons and need all names from that list whose last name starts with the letter 'J'. Or, you want to do something with all files in a folder whose names contain the sequence of numbers "154" and that have the file extension ".shp". Or, you want to find all occurrences where the word "red" is followed by the word "green" with at most two words in between in a longer text.

Support for these kinds of matching tasks is available in most programming languages based on an approach for denoting string patterns that is called regular expressions.

A regular expression is a string in which certain characters like '.', '*', '(', ')', etc. and certain combinations of characters are given special meanings to represent other characters and sequences of other characters. Surely you have already seen the expression "*.txt" to stand for all files with arbitrary names but ending in ".txt".

To give you another example before we approach this topic more systematically, the following regular expression "a.*b" in Python stands for all strings that start with the character 'a' followed by an arbitrary sequence of characters, followed by a 'b'. The dot here represents all characters and the star stands for an arbitrary number of repetitions. Therefore, this pattern would, for instance, match the strings 'acb', 'acdb', 'acdbb', etc.

Regular expressions like these can be used in functions provided by the programming language that, for instance, compare the expression to another string and then determine whether that string matches the pattern from the regular expression or not. Using such a function and applying it to, for example, a list of person names or file names allows us to perform some task only with those items from the list that match the given pattern.

In Python, the package from the standard library that provides support for regular expressions together with the functions for working with regular expressions is simply called "re". The function for comparing a regular expression to another string and telling us whether the string matches the expression is called match(...). Let's create a small example to learn how to write regular expressions. In this example, we have a list of names in a variable called personList, and we loop through this list comparing each name to a regular expression given in variable pattern and print out the name if it matches the pattern.

  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
import re 
 
personList = [ 'Julia Smith', 'Francis Drake', 'Michael Mason',  
                'Jennifer Johnson', 'John Williams', 'Susanne Walker',  
                'Kermit the Frog', 'Dr. Melissa Franklin', 'Papa John', 
                'Walter John Miller', 'Frank Michael Robertson', 'Richard Robertson', 
                'Erik D. White', 'Vincent van Gogh', 'Dr. Dr. Matthew Malone', 
                'Rebecca Clark' ] 
 
pattern = "John"
 
for person in personList: 
    if re.match(pattern, person): 
        print(person)   

Output:

John Williams

Before we try out different regular expressions with the code above, we want to mention that the part of the code following the name list is better written in the following way:

  1
  2
  3
  4
  5
  6
  7
pattern = "John"
 
compiledRE = re.compile(pattern) 
 
for person in personList: 
    if compiledRE.match(person): 
        print(person) 

Whenever we call a function from the "re" module like match(...) and provide the regular expression as a parameter to that function, the function will do some preprocessing of the regular expression and compile it into some data structure that allows for matching strings to that pattern efficiently. If we want to match several strings to the same pattern, as we are doing with the for-loop here, it is more time-efficient to explicitly perform this preprocessing and store the compiled pattern in a variable, and then invoke the match(...) method of that compiled pattern. In addition, explicitly compiling the pattern allows for providing additional parameters, e.g. when you want the matching to be done in a case-insensitive manner. In the code above, compiling the pattern happens in line 3 with the call of the re.compile(...) function and the compiled pattern is stored in variable compiledRE. Instead of the match(...) function, we now invoke the method match(...) of the compiled pattern object in variable person (line 6) that only needs one parameter, the string that should be matched to the pattern. Using this approach, the compilation of the pattern only happens once instead of once for each name from the list as in the first version.

One important thing to know about match(...) is that it always tries to match the pattern to the beginning of the given string but it allows for the string to contain additional characters after the entire pattern has been matched. That is the reason why when running the code above, the simple regular expression "John" matches "John Williams" but neither "Jennifer Johnson", "Papa John", nor "Walter John Miller". You may wonder how you would then ever write a pattern that only matches strings that end in a certain sequence of characters? The answer is that Python's regular expressions use the special characters ^ and $ to represent the beginning or the end of a string and this allows us to deal with such situations as we will see a bit further below.



Source: James O'Brien and John A. Dutton, https://www.e-education.psu.edu/geog489/node/2264
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License.