Read text files

By default python reading file API preserves the break-line character (\n on Linux for example).
in the example below we mean by « cleaned lines »,lines that we removed the break line characters

Read lines in memory:

def read_lines_in_memory(filename: str) -> List[str]:
    with open(filename) as file:
        lines = file.readlines()
    return lines

Read cleaned lines in memory:

def read_cleaned_lines_in_memory(filename: str) -> List[str]:
    with open(filename) as file:
        lines = [line.rstrip() for line in file]
    return lines

Read cleaned lines line by line:

def read_cleaned_lines_line_by_line(filename: str):
    with open(filename) as file:
        for line in file:
            print(line.rstrip())

Read cleaned lines line by line and replace tabulation characters by specific number of whitespaces:
note:by default tabulation characters are replaced by a specific number of whitespace, it is not managed by the application but by the system, here we have override this behavior

def read_cleaned_lines_line_by_line_and_replace_tabulation_characters_by_specific_number_of_whitespaces(
        filename: str):
    with open(filename) as file:
        for line in file:
            new_line = line.rstrip().replace('\t', ' ' * TAB_LENGTH)
            print(new_line)

Read cleaned lines line by line and remove blank lines:

def read_cleaned_lines_line_by_line_and_remove_blank_lines(filename: str):
    with open(filename) as file:
        for line in file:
            new_line = line.rstrip()
            if new_line.strip() != '':
                print(new_line)

Read specific cleaned lines line by line:

def read_specific_cleaned_lines_line_by_line(filename: str, line_numbers: List[int]):
    # we shift to -1 the line numbers because iteration start to 0
    line_numbers = [line - 1 for line in line_numbers]
    print(f'line_numbers={line_numbers}')
    with open(filename) as file:
        for index, line in enumerate(file):
            if index in line_numbers:
                print(line.rstrip())

Common errors

Missmatch between the encoding used to open the file and the actual file encoding

symptoms:
an exception is thrown when the first line of the file is read.

File "C:\Python39\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4261: character maps to <undefined>

Solution:
Specify explicitly the encoding to use by specifying the parameter encoding of the open() function.
For example, we could specify the UTF-8 encoding in this way:

with open(f, encoding="utf8") as file:

We read files one after the other but we don’t reinitialize the variable used to store the line content

symptoms:
The problem occurs when one of the file opening/reading triggers an early exception.
Generally to debug the issue we look at the current line read.
But here it looks to be inconsistent because the variable referencing the last read line contains the value of the previous file that was successfully read.
Solution:
Between opening and reading each file, we need to reinit the variable used to store the current line content(and any other variable referencing the read last line).
Example:

    for f in files:
        try:
            with open(f, encoding="utf8") as file:
                print(f'current read file={f}')
                i = 0
                line = None
                for i, line in enumerate(file):
                    print(f'line({i})={line}')

write and read file in python

Read text files

Common errors

Laisser un commentaire Annuler la réponse.

Pages

Catégories

Articles récents

Commentaires récents

Archives