Syntax and Usage
Special Characters and Their Purpose
Now let's have a look at the different special characters and some examples using them in combination with the name list code from above. Here is a brief overview of the characters and their purpose:
Character | Purpose |
---|---|
. | stands for a single arbitrary character |
[ ] | are used to define classes of characters and match any character of that class |
( ) | are used to define groups consisting of multiple characters in a sequence |
+ | stands for arbitrarily many repetitions of the previous character or group but at least one occurrence |
* | stands for arbitrarily many repetitions of the previous character or group including no occurrence |
? | stands for zero or one occurrence of the previous character or group, so basically says that the character or group is optional |
{m,n} | stands for at least m and at most n repetitions of the previous group where m and n are integer numbers |
^ | stands for the beginning of the string |
$ | stands for the end of the string |
| | stands between two characters or groups and matches either only the left or only the right character/group, so it is used to define alternatives |
\ | is used in combination with the next character to define special classes of characters |
Since the dot stands for any character, the regular expression ".u" can be used to get all names that have the letter 'u' as the second character. Give this a try by using ".u" for the regular expression in line 1 of the code from the previous example.
1
pattern = ".u"
The output will be:
Susanne Walker
Similarly, we can use "..cha" to get all names that start with two arbitrary characters followed by the character sequence resulting in "Michael Mason" and "Richard Robertson" being the only matches. By the way, it is strongly recommended that you experiment a bit in this section by modifying the patterns used in the examples. If in some case you don't understand the results you are getting, feel free to post this as a question on the course forums.
Maybe you are wondering how one would use the different special characters in the verbatim sense, e.g. to find all names that contain a dot. This is done by putting a backslash in front of them, so \. for the dot, \? for the question mark, and so on. If you want to match a single backslash in a regular expression, this needs to be represented by a double backslash in the regular expression. However, one has to be careful here when writing this regular expression as a string literal in the Python code: Because of the string escaping mechanism, a sequence of two backslashes will only produce a single backslash in the string character sequence. Therefore, you actually have to use four backslashes, "xyz\\\\xyz" to produce the correct regular expression involving a single backslash. Or you use a raw string in which escaping is disabled, so r"xyz\\xyz". Here is one example that uses \. to search for names with a dot as the third character returning "Dr. Melissa Franklin" and "Dr. Dr. Matthew Malone" as the only results:
1
pattern = "..\."
Next, let us combine the dot (.) with the star (*) symbol that stands for the repetition of the previous character. The pattern ".*John" can be used to find all names that contain the character sequence "John". The .* at the beginning can match any sequence
of characters of arbitrary length from the .class (so any available character). For Instance, for the name "Jennifer Johnson", the .* matches the sequence "Jennifer " produced from nine characters from the . class and since this is followed by the
character sequence "John", the entire name matches the regular expression.
1
pattern = ".*John"
Jennifer Johnson
John Williams
Papa John
Walter John Miller
Please note that the name "John Williams" is a valid match because the * also includes zero occurrences of the preceding character, so ".*John" will also match "John" at the beginning of a string.
The dot used in the previous examples is a special character for representing an entire class of characters, namely any character. It is also possible to define your own class of characters within a regular expression with the help of the squared brackets.
For instance, [abco] stands for the class consisting of only the characters 'a', 'b','c' and 'o'. When it is used in a regular expression, it matches any of these four characters. So the pattern ".[abco]" can, for instance, be used to get all names
that have either 'a', 'b', 'c', or 'o' as the second character. This means using ...
1
pattern = ".[abco]"
...we get the output:
Papa John
Walter John Miller
When defining classes, we can make use of ranges of characters denoted by a hyphen. For instance, the range m-o stands for the lower-case characters 'm', 'n', 'o' . The class [m-oM-O.] would then consist of the characters 'm', 'n', 'o', 'M', 'N', 'O',
and '.' . Please note that when a special character appears within the squared brackets of a class definition (like the dot in this example), it is used in its verbatim sense. Try this idea of using ranges out with the following example:
1
pattern = "......[m-oM-O.]"
The output will be...
Papa John
Walter John Miller
Dr. Dr. Matthew Malone
... because these are the only names that have a character from the class [m-oM-O.] as the seventh character.