Unit 8: Regular Expressions
8a. Explain why and how regular expressions are used
- What is a regular expression and what is the re module?
- What is the compile method used for?
- Name some special characters useful for forming regular expression,
A regular expression is a string containing combinations of characters that represent patterns contained within other strings. Once constructed, a regular expression search pattern can be used to search other strings for the pattern. The re
module contains a number of methods useful for applying regular expressions in Python. As part of this module, the compile
method is useful because it converts a regular expression pattern to a highly efficient form
for searching. Other important methods include match
and search
.
There are many special characters for regular expressions that should be reviewed. But, at the introductory level, it is important to master certain special characters useful for forming regular expressions such as .
,
*
,+
, [ ]
, and \
. The .
stands for an arbitrary character, the *
stands for
zero or more repetitions, the +
stands for one or more repetitions, the [ ]
stands for a class of characters, and the \
acts as a control character
to indicate an re
special character is to be used as an actual character to be searched for. As you master these, you can add others to your arsenal to form more sophisticated searches.
To review, see Syntax and Usage.
8b. Use regular expressions to construct search patterns to match a string or set of strings
- What does the search method do?\
- What does the match method do?
- Write some simple regular expressions that can be used to search with.
It is important to know the difference between re.search()
and re.match()
. re.match()
only looks for a pattern match at the beginning of a string.
re.search()
looks for the first match within a string (which could be at the beginning of a string). These are two of the most basic methods for searching with regular expressions.
Learning regular expressions is like learning a new language and much practice and reinforcement is required. Consider the regular expression
'.*\^\d\s[b-p]*'
and consider the set of possible strings that could match this pattern. The .*
means zero or an arbitrary number of repetitions of any character. The \^
means look for a ^
(notice, the \^
means do not use ^
as a special character). The \d
means look for a decimal digit and \s
means whitespace.
Finally, [b-p]*
means zero or an arbitrary number of repetitions of any character from b to p in the English alphabet. Convince yourself that any of the following strings
s='asdfgh^9 ' s='asdfgh^9 bcdmno' s='^9 bcdmno'
would generate a match for this regular expression
pattern='.*\^\d\s[b-p]*' print(re.search(pattern, s))
To review, see Delving Deeper.
8c. Solve common tasks by using regular expressions to match patterns
- How do we iterate regular expression pattern searches through a string?
- How does the
finditer
method work? - What useful information is available from the
finditer
method?
A major theme regarding Python programming is, once a problem can be solved for a single pass, how can it be extended to multiple passes? For example, once a for
loop could be used to iterate on a list, you saw a natural
extension of the loop statement in order to iterate on a dictionary. A similar progression follows for re
searches; however, an re
iterator must be made available. The finditer
method solves this problem so that the expected Python for
loop structure remains intact.
The finditer
method iterates on a string from left to right and non-overlapping matches are flagged in the order that they are found. Useful information such as the start and stop indices can be determined via the appropriate
method calls. For example,
for match in re.finditer(pattern, text): s = match.start() e = match.end() print(text[s:e])
can search for non-overlapping, multiple occurrences of a pattern within a string. In addition, the start
and end
methods give you the opportunity to record the locations of the match within
the string. Most importantly, the finditer
iterator preserves our expectation of how the for
statement should work in Python.
To review, see Delving Deeper.
Unit 8 Vocabulary
Be sure you understand these terms as you study for the final exam. Try to think of the reason why each term is included.
- regular expression
- re module
- compile method
- special characters
- finditer method