two main ways to use regex
– create a pattern object and apply regex function on it:
pattern: Pattern[str] = re.compile('animal') result: Optional[Match[str]] result = pattern.match('I am an animal') result = pattern.match('I am a boy') result = pattern.search('I am a girl') # and so for... |
we use that way (the compile()) function when we want to reuse a pattern(memory
efficient)
– call regex static functions by passing at least 2 parameters, the pattern (string or pattern)
and the text which we want to apply the pattern:
result = re.match('animal', 'I am an animal') |
Match instance functions
Notes: in all functions descriptions and examples we will rely on the same pattern and match applied:
pattern: Pattern[str] = re.compile('(.*)-(\d*)') result: Match[str] = pattern.match('David-123') |
groups()
: Return a tuple containing all the subgroups of the match.
Example:
result: Match[str] = pattern.match('David-123') print(result.groups()) # ('David', '123') |
group(group_number, ...)
: Returns one or more subgroups of the match.
If there is a single argument, the result is a single string;
else the result is a tuple with one item per argument.
Example:
print(result.group(0)) # David-123 print(result.group(1)) # David print(result.group(2)) # 123 print(result.group(1, 2)) # (David,123) |
__getitem__()
: identical to group(group_number) but here a list is returned.
Example:
print(result[0]) # David-123 print(result[1]) # David print(result[2]) # 123 |
Match.string
: The string passed to match() or search().
Example:
print(f'result.string={result.string}') # David-123 |
start([group_number])
and end([group_number])
: Return
the indices of the start and end of the substring matched by group; group defaults to zero
(meaning the whole matched substring.
Example:
matched_value = result.string[result.start(1):result.end(1)] print(f'matched_value={matched_value}') # David entire_matched_value = result.string[result.start():result.end()] print(f'entire_matched_value={entire_matched_value}') # David-123 |
functions returning Matches
re.match(pattern: Pattern[AnyStr], string: AnyStr,)
Search a match only at the beginning of the string.
example:
result: Optional[Match[str]] # match() checks for a match only at the beginning of the string result = re.match('animal', 'I am an animal') self.assertIsNone(result) result = re.match('animal', 'animal and so for') self.assertEqual((0, 6), (result.start(), result.end())) |
re.search(pattern: Pattern[AnyStr], string: AnyStr,)
Search the first location where the regex matches.It searches a match anywhere in the
string.
example:
result = re.search('animal', 'I am an animal') self.assertEqual((8, 14), (result.start(), result.end())) print(f'result={result}') # search() can also restrict if research at the beginning or not result = re.search('^animal', 'I am an animal') self.assertIsNone(result) |
Pattern.fullmatch(string[, pos[, endpos]])
Must match the whole string.
Example:
pattern_snake_text: Pattern = re.compile(r'([à-ÿa-z0-9]+_)+[à-ÿa-z0-9]+') text = 'dump_traceback_later dummy' match: Match = pattern_snake_text.match(text) # match=re.Match object; span=(0, 20), match='dump_traceback_later' match: Match = pattern_snake_text.fullmatch(text) # match=None |
match versus search with multiline
In MULTILINE mode, match()
only matches at the beginning of the string, whereas
using search()
with a regular expression beginning with ‘^’ will match at the
beginning of each line.
Example:
result: Optional[Match[str]] result = re.match('X', 'A\nB\nX', re.MULTILINE) self.assertIsNone(result) # with search result = re.search('^X', 'A\nB\nX', re.MULTILINE) self.assertEqual((4, 5), (result.start(), result.end())) |
finditer( pattern, string,)
Return an iterator of all matches.
Example:
line = 'i_am_a_very_good_person' # 1 4 6 11 16 matches: list[Match[str]] = list(re.finditer('_', line)) print(f'matches={matches}') self.assertEqual((1, 2), (matches[0].start(), matches[0].end())) self.assertEqual((4, 5), (matches[1].start(), matches[1].end())) # and so for... |
functions returning the matching strings
findall(pattern,string)
: Return a list of all non-overlapping matches in the
string.
– If there are no group: return a list of strings matching the whole pattern.
– If there is exactly one group: return a list of strings matching that group.
– If multiple groups are present: return a list of tuples of strings matching the groups
beware:
Empty matches are included in the result.
Example with no group:
# \b matches whitespace(s) before or after a word results: List[str] = re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest f') self.assertEqual(['foot', 'fell', 'fastest', 'f'], results) |
Example with groups:
# \w represents a word character, while \b represents a word boundary between a word character and a non-word character. results: List[tuple[str]] = re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10') self.assertEqual([('width', '20'), ('height', '10')], results) |
Replace functions
re.sub()
function achieves that.
Definition:
def sub(pattern: Pattern[AnyStr], repl: (Match[AnyStr]) -> AnyStr, string: AnyStr, count: int = ..., flags: int | RegexFlag = ...) -> AnyStr |
Specification:
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in
string by the replacement repl.
If the pattern isn’t found, string is returned unchanged.
repl can be a string or a function.
– if repl
is a string, any backslash escapes in it are processed.
That is, \n
is converted to a single newline character, \r
is
converted to a carriage return,
and so forth. Unknown escapes of ASCII letters are reserved for future use and treated as
errors. Other unknown escapes such as \&
are left alone.
– If repl
is a function, it is called for every non-overlapping occurrence
of pattern.
The function takes a single match object argument, and returns the replacement string.
Our use case of string replacement
Suppose we have a string that refers to a code statement referencing a path. If the line is
terminated with a todo comment, we want to replace this path with a default path and remove the
todo comment.
We will illustrate this case with the two ways: string and function replacement.
In both cases, the search regex is the same.The difference lays in the way the string
replacement and the function replacement will perform the actual replacement.
here is the common part of the statement:
new_line = re.sub(re.compile(r'^(.*\([\'"])(.*/.*\.png)(.*)(# TODO.*)$'), |
The general idea of this pattern:
– We specify a group for each part of the regex we want to process specifically in the
replacement
expression.
– Here, we don’t want to modify other things then the path of the image so what we have before
and what we have after is specified inside a group(groups 1 and 3).
And because we want to remove todo comment, we specify it in a specific group(group 4)
1) With a string as replacement
text: str = 'open("tests/foo_image.png", 0, 0.8) # TODO ' new_path: str = 'tests/default.png' # 1 2 3 4 new_line = re.sub(re.compile(r'^(.*\([\'"])(.*/.*\.png)(.*)(# TODO.*)$'), rf'\1{new_path}\3', text) print(f'new_line={new_line}') #new_line=open("tests/default.png", 0, 0.8) |
In the replacement expression we can specify or not the group and in whatever order we want to :
rf'\1{new_path}\3'
.
In \n
, n
represents the group number to output.
2) With a function as replacement
def replaced_function(match: Match): print(f'result.groups()={match.groups()}') replacement: str = '' if match.group(1): print(f'match.group(1)={match.group(1)}') replacement += match.group(1) + '{new_name}' if match.group(3): print(f'match.group(3)={match.group(3)}') replacement += match.group(3).replace('\\', '/') print(f'replacement={replacement}') return replacement text: str = 'open("tests/foo_image.png", 0, 0.8) # TODO ' new_path: str = 'tests/default.png' new_line = re.sub(re.compile(r'^(.*\([\'"])(.*/.*\.png)(.*)(# TODO.*)$'), replaced_function, text) new_line = new_line.format(new_name=new_path) print(f'new_line={new_line}') |
Output:
result.groups()=('open("', 'tests/foo_image.png', '", 0, 0.8) ', '# TODO ') match.group(1)=open(" match.group(3)=", 0, 0.8) replacement=open("{new_name}", 0, 0.8) new_line=open("tests/default.png", 0, 0.8) |
Contrary to the way with the string replacement, that time we cannot concatenate the new path value directly in the replacement function because the function doesn’t accept any additional parameters, so we need to trick by specifying a string formatted parameter in the string returned by our function:
if match.group(1): print(f'match.group(1)={match.group(1)}') replacement += match.group(1) + '{new_name}' |
In this way, we can value this parameter after the return of the function:
new_line = new_line.format(new_name=new_path)
Other functions
re.escape()
: Escape special characters in pattern. Useful to match an
arbitrary literal string that may have regular expression metacharacters in it.
For example:
regex = re.escape('\ a.*$') print(f'regex={regex}') # regex=\\\ a\.\*\$ result = re.search(regex, '\ a.*$') self.assertEqual((0, 6), (result.start(), result.end())) |