CS105 Study Guide

Unit 8: Regular Expressions

8a. Explain why and how regular expressions are used

  • What is a regular expression and what is the re module?
  • What is the compile method used for?
  • Name some special characters useful for forming regular expression,

A regular expression is a string containing combinations of characters that represent patterns contained within other strings. Once constructed, a regular expression search pattern can be used to search other strings for the pattern. The re module contains a number of methods useful for applying regular expressions in Python. As part of this module, the compile method is useful because it converts a regular expression pattern to a highly efficient form for searching. Other important methods include match and search.

There are many special characters for regular expressions that should be reviewed. But, at the introductory level, it is important to master certain special characters useful for forming regular expressions such as ., *,+, [ ], and \. The . stands for an arbitrary character, the * stands for zero or more repetitions, the + stands for one or more repetitions, the [ ] stands for a class of characters, and the \ acts as a control character to indicate an re special character is to be used as an actual character to be searched for. As you master these, you can add others to your arsenal to form more sophisticated searches.

To review, see Syntax and Usage.

 

8b. Use regular expressions to construct search patterns to match a string or set of strings

  • What does the search method do?\
  • What does the match method do?
  • Write some simple regular expressions that can be used to search with.

It is important to know the difference between re.search() and re.match(). re.match() only looks for a pattern match at the beginning of a string. re.search() looks for the first match within a string (which could be at the beginning of a string). These are two of the most basic methods for searching with regular expressions.

Learning regular expressions is like learning a new language and much practice and reinforcement is required. Consider the regular expression

'.*\^\d\s[b-p]*'

and consider the set of possible strings that could match this pattern. The .* means zero or an arbitrary number of repetitions of any character. The \^ means look for a ^ (notice, the \^ means do not use ^ as a special character). The \d means look for a decimal digit and \s means whitespace. Finally, [b-p]* means zero or an arbitrary number of repetitions of any character from b to p in the English alphabet. Convince yourself that any of the following strings

s='asdfgh^9 '
s='asdfgh^9 bcdmno'
s='^9 bcdmno'

would generate a match for this regular expression

pattern='.*\^\d\s[b-p]*'
print(re.search(pattern, s))

To review, see Delving Deeper.

 

8c. Solve common tasks by using regular expressions to match patterns

  • How do we iterate regular expression pattern searches through a string?
  • How does the finditer method work?
  • What useful information is available from the finditer method?

A major theme regarding Python programming is, once a problem can be solved for a single pass, how can it be extended to multiple passes? For example, once a for loop could be used to iterate on a list, you saw a natural extension of the loop statement in order to iterate on a dictionary. A similar progression follows for re searches; however, an re iterator must be made available. The finditer method solves this problem so that the expected Python for loop structure remains intact.

The finditer method iterates on a string from left to right and non-overlapping matches are flagged in the order that they are found. Useful information such as the start and stop indices can be determined via the appropriate method calls. For example,

for match in re.finditer(pattern, text):
    s = match.start()
    e = match.end()
    print(text[s:e])

can search for non-overlapping, multiple occurrences of a pattern within a string. In addition, the start and end methods give you the opportunity to record the locations of the match within the string. Most importantly, the finditer iterator preserves our expectation of how the for statement should work in Python.

To review, see Delving Deeper.

 

Unit 8 Vocabulary

Be sure you understand these terms as you study for the final exam. Try to think of the reason why each term is included.

  • regular expression
  • re module 
  • compile method
  • special characters
  • finditer method