Raw regex

It is very different from the java approach that doesn’t have that capacity. It allows to escape in a clever way: we escape only when needed. Example here with classic regex:

two main ways to use regex

    - create a pattern object and apply regex function on it:



pattern: Pattern[str] = re.compile('animal')
result: Optional[Match[str]]
result = pattern.match('I am an animal')
result = pattern.match('I am a boy')
result = pattern.search('I am a girl')
# and so for...




    we use that way (the compile()) function when we want to reuse a pattern(memory
    efficient)



    - call regex static functions by passing at least 2 parameters, the pattern (string or pattern)
    and the text which we want to apply the pattern:



result = re.match('animal', 'I am an animal')




Match instance functions

    Notes: in all functions descriptions and examples we will rely on the same pattern and
    match applied:


pattern: Pattern[str] = re.compile('(.*)-(\d*)')
result: Match[str] = pattern.match('David-123')





    groups(): Return a tuple containing all the subgroups of the match.
    

    Example:


result: Match[str] = pattern.match('David-123')
print(result.groups())  # ('David', '123')





    group(group_number, ...): Returns one or more subgroups of the match.
    If there is a single argument, the result is a single string;
    else the result is a tuple with one item per argument.
    

    Example:



print(result.group(0))  # David-123
print(result.group(1))  # David
print(result.group(2))  # 123
print(result.group(1, 2))  # (David,123)





    __getitem__(): identical to group(group_number) but here a list is returned.

    Example:



print(result[0])  # David-123
print(result[1])  # David
print(result[2])  # 123





    Match.string: The string passed to match() or search().
    

    Example:



print(f'result.string={result.string}')  # David-123





    start([group_number]) and end([group_number]): Return
    the indices of the start and end of the substring matched by group; group defaults to zero
    (meaning the whole matched substring.

    Example:



matched_value = result.string[result.start(1):result.end(1)]
print(f'matched_value={matched_value}')  # David
entire_matched_value = result.string[result.start():result.end()]
print(f'entire_matched_value={entire_matched_value}')  # David-123




functions returning Matches

    re.match(pattern: Pattern[AnyStr], string: AnyStr,)
    

    Search a match only at the beginning of the string.
    

    example:



result: Optional[Match[str]]
# match() checks for a match only at the beginning of the string
result = re.match('animal', 'I am an animal')
self.assertIsNone(result)
result = re.match('animal', 'animal and so for')
self.assertEqual((0, 6), (result.start(), result.end()))





    re.search(pattern: Pattern[AnyStr], string: AnyStr,)
    

    Search the first location where the regex matches.It searches a match anywhere in the
    string.

    example:



result = re.search('animal', 'I am an animal')
self.assertEqual((8, 14), (result.start(), result.end()))
print(f'result={result}')
# search() can also restrict if research at the beginning or not
result = re.search('^animal', 'I am an animal')
self.assertIsNone(result)




Pattern.fullmatch(string[, pos[, endpos]])
Must match the whole string.

Example:



pattern_snake_text: Pattern = re.compile(r'([à-ÿa-z0-9]+_)+[à-ÿa-z0-9]+')
text = 'dump_traceback_later dummy'
match: Match = pattern_snake_text.match(text)
# match=re.Match object; span=(0, 20), match='dump_traceback_later'
match: Match = pattern_snake_text.fullmatch(text)
# match=None





    match versus search with multiline

    In MULTILINE mode, match() only matches at the beginning of the string, whereas
    using search() with a regular expression beginning with '^' will match at the
    beginning of each line.
    

    Example:



result: Optional[Match[str]]
result = re.match('X', 'A\nB\nX', re.MULTILINE)
self.assertIsNone(result)
# with search
result = re.search('^X', 'A\nB\nX', re.MULTILINE)
self.assertEqual((4, 5), (result.start(), result.end()))





    finditer( pattern, string,)

    Return an iterator of all matches.

    Example:



line = 'i_am_a_very_good_person'
#        1  4  6   11   16
matches: list[Match[str]] = list(re.finditer('_', line))
print(f'matches={matches}')
self.assertEqual((1, 2), (matches[0].start(), matches[0].end()))
self.assertEqual((4, 5), (matches[1].start(), matches[1].end()))
# and so for...




functions returning the matching strings

    findall(pattern,string): Return a list of all non-overlapping matches in the
    string.

    - If there are no group: return a list of strings matching the whole pattern.

    - If there is exactly one group: return a list of strings matching that group.

    - If multiple groups are present: return a list of tuples of strings matching the groups





beware:


Empty matches are included in the result.




    Example with no group:



# \b matches whitespace(s) before or after a word
results: List[str] = re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest f')
self.assertEqual(['foot', 'fell', 'fastest', 'f'], results)





    Example with groups:



# \w represents a word character, while \b represents a word boundary between a word character and a non-word character.
results: List[tuple[str]] = re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
self.assertEqual([('width', '20'), ('height', '10')], results)




Replace functions

    re.sub() function achieves that.
    

    

    Definition:



    def sub(pattern: Pattern[AnyStr],
        repl: (Match[AnyStr]) -> AnyStr,
        string: AnyStr,
        count: int = ...,
        flags: int | RegexFlag = ...) -> AnyStr




    Specification:
    


    Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in
    string by the replacement repl.

    If the pattern isn’t found, string is returned unchanged.

    repl can be a string or a function.

    - if repl is a string, any backslash escapes in it are processed.
    That is, \n is converted to a single newline character, \r is
    converted to a carriage return,
    and so forth. Unknown escapes of ASCII letters are reserved for future use and treated as
    errors. Other unknown escapes such as \& are left alone.
    

    - If repl is a function, it is called for every non-overlapping occurrence
    of pattern.

    The function takes a single match object argument, and returns the replacement string.
    

    


Our use case of string replacement


    Suppose we have a string that refers to a code statement referencing a path. If the line is
    terminated with a todo comment, we want to replace this path with a default path and remove the
    todo comment.
    

    We will illustrate this case with the two ways: string and function replacement.
    

    In both cases, the search regex is the same.The difference lays in the way the string
    replacement and the function replacement will perform the actual replacement.
    

    here is the common part of the statement:





new_line = re.sub(re.compile(r'^(.*\([\'&quot;])(.*/.*\.png)(.*)(# TODO.*)$'),




    The general idea of this pattern:
    

    - We specify a group for each part of the regex we want to process specifically in the
    replacement
    expression.
    

    - Here, we don't want to modify other things then the path of the image so what we have before
    and what we have after is specified inside a group(groups 1 and 3).
    

    And because we want to remove todo comment, we specify it in a specific group(group 4)



    1) With a string as replacement




text: str = 'open("tests/foo_image.png", 0, 0.8) # TODO '
new_path: str = 'tests/default.png'
#                                       1            2        3     4
new_line = re.sub(re.compile(r'^(.*\([\'"])(.*/.*\.png)(.*)(# TODO.*)$'),
                  rf'\1{new_path}\3',
                  text)
print(f'new_line={new_line}')
#new_line=open("tests/default.png", 0, 0.8)





    In the replacement expression we can specify or not the group and in whatever order we want to :
    rf'\1{new_path}\3'.
    

    In \n, n represents the group number to output.



    2) With a function as replacement




def replaced_function(match: Match):
    print(f'result.groups()={match.groups()}')
    replacement: str = ''
    if match.group(1):
        print(f'match.group(1)={match.group(1)}')
        replacement += match.group(1) + '{new_name}'
    if match.group(3):
        print(f'match.group(3)={match.group(3)}')
        replacement += match.group(3).replace('\\', '/')
 
    print(f'replacement={replacement}')
    return replacement
 
 
text: str = 'open("tests/foo_image.png", 0, 0.8) # TODO '
new_path: str = 'tests/default.png'
new_line = re.sub(re.compile(r'^(.*\([\'"])(.*/.*\.png)(.*)(# TODO.*)$'),
                  replaced_function,
                  text)
new_line = new_line.format(new_name=new_path)
print(f'new_line={new_line}')




    Output:




result.groups()=('open("', 'tests/foo_image.png', '", 0, 0.8) ', '# TODO ')
match.group(1)=open("
match.group(3)=", 0, 0.8)
replacement=open("{new_name}", 0, 0.8)
new_line=open("tests/default.png", 0, 0.8)




    Contrary to the way with the string replacement, that time we cannot concatenate the new path
    value
    directly in the replacement function because the function doesn't accept any additional
    parameters, so we
    need to trick by specifying a string formatted parameter in the string returned by our function:




if match.group(1):
        print(f'match.group(1)={match.group(1)}')
        replacement += match.group(1) + '{new_name}'




    In this way, we can value this parameter after the return of the function:
    

    new_line = new_line.format(new_name=new_path)


Other functions


    re.escape(): Escape special characters in pattern. Useful to match an
    arbitrary literal string that may have regular expression metacharacters in it.

    For example:




regex = re.escape('\ a.*$')
print(f'regex={regex}')
# regex=\\\ a\.\*\$
result = re.search(regex, '\ a.*$')
self.assertEqual((0, 6), (result.start(), result.end()))
regex in python

Raw regex

two main ways to use regex

Match instance functions

functions returning Matches

functions returning the matching strings

Replace functions

Our use case of string replacement

Other functions

Laisser un commentaire Annuler la réponse.

Pages

Catégories

Articles récents

Commentaires récents

Archives