Read text files
By default python reading file API preserves the break-line character (\n on Linux for example).
in the example below we mean by « cleaned lines »,lines that we removed the break line characters
Read lines in memory:
def read_lines_in_memory(filename: str) -> List[str]: with open(filename) as file: lines = file.readlines() return lines |
Read cleaned lines in memory:
def read_cleaned_lines_in_memory(filename: str) -> List[str]: with open(filename) as file: lines = [line.rstrip() for line in file] return lines |
Read cleaned lines line by line:
def read_cleaned_lines_line_by_line(filename: str): with open(filename) as file: for line in file: print(line.rstrip()) |
Read cleaned lines line by line and replace tabulation characters by specific number of
whitespaces:
note:by default tabulation characters are replaced by a specific number of whitespace, it
is not managed by the application but by the system, here we have override this behavior
def read_cleaned_lines_line_by_line_and_replace_tabulation_characters_by_specific_number_of_whitespaces( filename: str): with open(filename) as file: for line in file: new_line = line.rstrip().replace('\t', ' ' * TAB_LENGTH) print(new_line) |
Read cleaned lines line by line and remove blank lines:
def read_cleaned_lines_line_by_line_and_remove_blank_lines(filename: str): with open(filename) as file: for line in file: new_line = line.rstrip() if new_line.strip() != '': print(new_line) |
Read specific cleaned lines line by line:
def read_specific_cleaned_lines_line_by_line(filename: str, line_numbers: List[int]): # we shift to -1 the line numbers because iteration start to 0 line_numbers = [line - 1 for line in line_numbers] print(f'line_numbers={line_numbers}') with open(filename) as file: for index, line in enumerate(file): if index in line_numbers: print(line.rstrip()) |
Common errors
Missmatch between the encoding used to open the file and the actual file encoding
symptoms:
an exception is thrown when the first line of the file is read.
File "C:\Python39\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4261: character maps to <undefined> |
Solution:
Specify explicitly the encoding to use by specifying the parameter encoding of the
open()
function.
For example, we could specify the UTF-8 encoding in this way:
with open(f, encoding="utf8") as file: |
We read files one after the other but we don’t reinitialize the variable used to
store the line content
symptoms:
The problem occurs when one of the file opening/reading triggers an early exception.
Generally to debug the issue we look at the current line read.
But here it looks to be inconsistent because the variable referencing the last read line
contains the value of the previous file that was successfully read.
Solution:
Between opening and reading each file, we need to reinit the variable used to store the
current line content(and any other variable referencing the read last line).
Example:
for f in files: try: with open(f, encoding="utf8") as file: print(f'current read file={f}') i = 0 line = None for i, line in enumerate(file): print(f'line({i})={line}') |