regex in python

two main ways to use regex

– create a pattern object and apply regex function on it:

pattern: Pattern[str] = re.compile('animal')
result: Optional[Match[str]]
result = pattern.match('I am an animal')
result = pattern.match('I am a boy')
result = pattern.search('I am a girl')
# and so for...

we use that way (the compile()) function when we want to reuse a pattern(memory efficient)

– call regex static functions by passing at least 2 parameters, the pattern (string or pattern) and the text which we want to apply the pattern:

result = re.match('animal', 'I am an animal')

Match instance functions

Notes: in all functions descriptions and examples we will rely on the same pattern and match applied:

pattern: Pattern[str] = re.compile('(.*)-(\d*)')
result: Match[str] = pattern.match('David-123')

groups(): Return a tuple containing all the subgroups of the match.
Example:

result: Match[str] = pattern.match('David-123')
print(result.groups())  # ('David', '123')

group(group_number, ...): Returns one or more subgroups of the match. If there is a single argument, the result is a single string; else the result is a tuple with one item per argument.
Example:

print(result.group(0))  # David-123
print(result.group(1))  # David
print(result.group(2))  # 123
print(result.group(1, 2))  # (David,123)

__getitem__(): identical to group(group_number) but here a list is returned.
Example:

print(result[0])  # David-123
print(result[1])  # David
print(result[2])  # 123

Match.string: The string passed to match() or search().
Example:

print(f'result.string={result.string}')  # David-123

start([group_number]) and end([group_number]): Return the indices of the start and end of the substring matched by group; group defaults to zero (meaning the whole matched substring.
Example:

matched_value = result.string[result.start(1):result.end(1)]
print(f'matched_value={matched_value}')  # David
entire_matched_value = result.string[result.start():result.end()]
print(f'entire_matched_value={entire_matched_value}')  # David-123

functions returning Matches

re.match(pattern: Pattern[AnyStr], string: AnyStr,)
Search a match only at the beginning of the string.
example:

result: Optional[Match[str]]
# match() checks for a match only at the beginning of the string
result = re.match('animal', 'I am an animal')
self.assertIsNone(result)
result = re.match('animal', 'animal and so for')
self.assertEqual((0, 6), (result.start(), result.end()))

re.search(pattern: Pattern[AnyStr], string: AnyStr,)
Search the first location where the regex matches.It searches a match anywhere in the string.
example:

result = re.search('animal', 'I am an animal')
self.assertEqual((8, 14), (result.start(), result.end()))
print(f'result={result}')
# search() can also restrict if research at the beginning or not
result = re.search('^animal', 'I am an animal')
self.assertIsNone(result)

Pattern.fullmatch(string[, pos[, endpos]])
Must match the whole string.
Example:

pattern_snake_text: Pattern = re.compile(r'([à-ÿa-z0-9]+_)+[à-ÿa-z0-9]+')
text = 'dump_traceback_later dummy'
match: Match = pattern_snake_text.match(text)
# match=re.Match object; span=(0, 20), match='dump_traceback_later'
match: Match = pattern_snake_text.fullmatch(text)
# match=None

match versus search with multiline
In MULTILINE mode, match() only matches at the beginning of the string, whereas using search() with a regular expression beginning with ‘^’ will match at the beginning of each line.
Example:

result: Optional[Match[str]]
result = re.match('X', 'A\nB\nX', re.MULTILINE)
self.assertIsNone(result)
# with search
result = re.search('^X', 'A\nB\nX', re.MULTILINE)
self.assertEqual((4, 5), (result.start(), result.end()))

finditer( pattern, string,)
Return an iterator of all matches.
Example:

line = 'i_am_a_very_good_person'
#        1  4  6   11   16
matches: list[Match[str]] = list(re.finditer('_', line))
print(f'matches={matches}')
self.assertEqual((1, 2), (matches[0].start(), matches[0].end()))
self.assertEqual((4, 5), (matches[1].start(), matches[1].end()))
# and so for...

functions returning the matching strings

findall(pattern,string): Return a list of all non-overlapping matches in the string.
– If there are no group: return a list of strings matching the whole pattern.
– If there is exactly one group: return a list of strings matching that group.
– If multiple groups are present: return a list of tuples of strings matching the groups

beware:
Empty matches are included in the result.

Example with no group:

# \b matches whitespace(s) before or after a word
results: List[str] = re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest f')
self.assertEqual(['foot', 'fell', 'fastest', 'f'], results)

Example with groups:

# \w represents a word character, while \b represents a word boundary between a word character and a non-word character.
results: List[tuple[str]] = re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
self.assertEqual([('width', '20'), ('height', '10')], results)

Replace functions

re.sub() function achieves that.

Definition:

    def sub(pattern: Pattern[AnyStr],
        repl: (Match[AnyStr]) -> AnyStr,
        string: AnyStr,
        count: int = ...,
        flags: int | RegexFlag = ...) -> AnyStr

Specification:
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.
If the pattern isn’t found, string is returned unchanged.
repl can be a string or a function.
if repl is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes of ASCII letters are reserved for future use and treated as errors. Other unknown escapes such as \& are left alone.
– If repl is a function, it is called for every non-overlapping occurrence of pattern.
The function takes a single match object argument, and returns the replacement string.

Our use case of string replacement

Suppose we have a string that refers to a code statement referencing a path. If the line is terminated with a todo comment, we want to replace this path with a default path and remove the todo comment.
We will illustrate this case with the two ways: string and function replacement.
In both cases, the search regex is the same.The difference lays in the way the string replacement and the function replacement will perform the actual replacement.
here is the common part of the statement:

new_line = re.sub(re.compile(r'^(.*\([\'"])(.*/.*\.png)(.*)(# TODO.*)$'),

The general idea of this pattern:
– We specify a group for each part of the regex we want to process specifically in the replacement expression.
– Here, we don’t want to modify other things then the path of the image so what we have before and what we have after is specified inside a group(groups 1 and 3).
And because we want to remove todo comment, we specify it in a specific group(group 4)

1) With a string as replacement

text: str = 'open("tests/foo_image.png", 0, 0.8) # TODO '
new_path: str = 'tests/default.png'
#                                       1            2        3     4
new_line = re.sub(re.compile(r'^(.*\([\'"])(.*/.*\.png)(.*)(# TODO.*)$'),
                  rf'\1{new_path}\3',
                  text)
print(f'new_line={new_line}')
#new_line=open("tests/default.png", 0, 0.8)

In the replacement expression we can specify or not the group and in whatever order we want to : rf'\1{new_path}\3'.
In \n, n represents the group number to output.

2) With a function as replacement

def replaced_function(match: Match):
    print(f'result.groups()={match.groups()}')
    replacement: str = ''
    if match.group(1):
        print(f'match.group(1)={match.group(1)}')
        replacement += match.group(1) + '{new_name}'
    if match.group(3):
        print(f'match.group(3)={match.group(3)}')
        replacement += match.group(3).replace('\\', '/')
 
    print(f'replacement={replacement}')
    return replacement
 
 
text: str = 'open("tests/foo_image.png", 0, 0.8) # TODO '
new_path: str = 'tests/default.png'
new_line = re.sub(re.compile(r'^(.*\([\'"])(.*/.*\.png)(.*)(# TODO.*)$'),
                  replaced_function,
                  text)
new_line = new_line.format(new_name=new_path)
print(f'new_line={new_line}')

Output:

result.groups()=('open("', 'tests/foo_image.png', '", 0, 0.8) ', '# TODO ')
match.group(1)=open("
match.group(3)=", 0, 0.8)
replacement=open("{new_name}", 0, 0.8)
new_line=open("tests/default.png", 0, 0.8)

Contrary to the way with the string replacement, that time we cannot concatenate the new path value directly in the replacement function because the function doesn’t accept any additional parameters, so we need to trick by specifying a string formatted parameter in the string returned by our function:

if match.group(1):
        print(f'match.group(1)={match.group(1)}')
        replacement += match.group(1) + '{new_name}'

In this way, we can value this parameter after the return of the function:
new_line = new_line.format(new_name=new_path)

Other functions

re.escape(): Escape special characters in pattern. Useful to match an arbitrary literal string that may have regular expression metacharacters in it.
For example:

regex = re.escape('\ a.*$')
print(f'regex={regex}')
# regex=\\\ a\.\*\$
result = re.search(regex, '\ a.*$')
self.assertEqual((0, 6), (result.start(), result.end()))

Ce contenu a été publié dans Non classé. Vous pouvez le mettre en favoris avec ce permalien.

Laisser un commentaire

Votre adresse de messagerie ne sera pas publiée. Les champs obligatoires sont indiqués avec *