[Python] Efficiently Removing Blank Lines from Text Data

Text data read from external files or obtained via web scraping often contains unnecessary blank lines. As a preliminary step in data processing, you may want to remove these empty lines (or lines containing only whitespace) to clean up the data.

In Python, you can implement this cleansing process concisely by combining string manipulation methods with list comprehensions. Here, I will introduce a code example for formatting text data that contains irregular blank lines.

目次

Table of Contents

  1. Logic for Removing Blank Lines
  2. Implementation Example: Formatting Address Data
  3. Code Explanation

Logic for Removing Blank Lines

The basic approach to removing blank lines is as follows:

  1. Line Splitting: Convert the entire text into a list of lines.
  2. Judgment and Extraction: For each line, check “if the string remains after removing whitespace characters” and extract only the lines that meet this condition.
  3. Rejoining: Join the extracted lines back together.

Implementation Example: Formatting Address Data

The following code is an example of compacting and formatting address data that contains irregular blank lines, such as those caused by manual entry.

def main():
    # Text data containing irregular blank lines or lines with only spaces
    raw_address_data = """
Tokyo, Shinjuku
    
Osaka, Umeda
   
   
Nagoya, Sakae

Fukuoka, Hakata
"""

    # 1. Split the string into lines
    # splitlines() handles newline codes (\n, \r\n) appropriately
    lines = raw_address_data.splitlines()

    # 2. Create a new list excluding blank lines
    # line.strip() removes leading and trailing whitespace.
    # Keep in the list only if the string exists (True) after removing whitespace.
    clean_lines = [line for line in lines if line.strip()]

    # 3. Join the formatted lines with newline codes
    formatted_text = "\n".join(clean_lines)

    # Output result
    print("--- Data Before Formatting (For Check) ---")
    print(f"'{raw_address_data}'")
    print("\n--- Data After Formatting ---")
    print(formatted_text)

if __name__ == "__main__":
    main()

Code Explanation

Judgment using line.strip()

The strip() method removes all whitespace characters (spaces, tabs, newlines, etc.) from the beginning and end of a string.

  • Line with text: " Tokyo " -> "Tokyo" (Evaluated as True)
  • Empty line or line with only spaces: " " -> "" (Evaluated as False)

By utilizing this property and writing if line.strip():, you can filter out lines that contain no substantial data.

Advantage of splitlines()

If you use split('\n'), unintended behavior may occur, such as an empty string being added to the end of the list if the last line contains a newline code. splitlines() handles these edge cases appropriately, making it more suitable for line-by-line processing.

This process is useful for many text processing tasks, such as pre-processing for CSV file loading or normalizing user input text.

よかったらシェアしてね!
  • URLをコピーしました!
  • URLをコピーしました!

この記事を書いた人

私が勉強したこと、実践したこと、してることを書いているブログです。
主に資産運用について書いていたのですが、
最近はプログラミングに興味があるので、今はそればっかりです。

目次