Splitting with Patterns
str.split() is one of the most frequently used methods for breaking apart strings to parse them. It supports only the use of literal values as separators, though, and sometimes a regular expression is necessary if the input is not consistently formatted. For example, many plain text markup languages define paragraph separators as two or more newline (
\n) characters. In this case,
str.split() cannot be used because of the "or more" part of the definition.
A strategy for identifying paragraphs using
findall() would use a pattern like
That pattern fails for paragraphs at the end of the input text, as illustrated by the fact that "Paragraph three." is not part of the output.
Extending the pattern to say that a paragraph ends with two or more newlines or the end of input fixes the problem, but makes the pattern more complicated. Converting to
re.split() instead of
re.findall() handles the boundary condition automatically and keeps the pattern simpler.
The pattern argument to
split() expresses the markup specification more precisely. Two or more newline characters mark a separator point between paragraphs in the input string.
Enclosing the expression in parentheses to define a group causes
split() to work more like
str.partition(), so it returns the separator values as well as the other parts of the string.
The output now includes each paragraph, as well as the sequence of newlines separating them.