Text data read from external files or obtained via web scraping often contains unnecessary blank lines. As a preliminary step in data processing, you may want to remove these empty lines (or lines containing only whitespace) to clean up the data.
In Python, you can implement this cleansing process concisely by combining string manipulation methods with list comprehensions. Here, I will introduce a code example for formatting text data that contains irregular blank lines.
Table of Contents
- Logic for Removing Blank Lines
- Implementation Example: Formatting Address Data
- Code Explanation
Logic for Removing Blank Lines
The basic approach to removing blank lines is as follows:
- Line Splitting: Convert the entire text into a list of lines.
- Judgment and Extraction: For each line, check “if the string remains after removing whitespace characters” and extract only the lines that meet this condition.
- Rejoining: Join the extracted lines back together.
Implementation Example: Formatting Address Data
The following code is an example of compacting and formatting address data that contains irregular blank lines, such as those caused by manual entry.
def main():
# Text data containing irregular blank lines or lines with only spaces
raw_address_data = """
Tokyo, Shinjuku
Osaka, Umeda
Nagoya, Sakae
Fukuoka, Hakata
"""
# 1. Split the string into lines
# splitlines() handles newline codes (\n, \r\n) appropriately
lines = raw_address_data.splitlines()
# 2. Create a new list excluding blank lines
# line.strip() removes leading and trailing whitespace.
# Keep in the list only if the string exists (True) after removing whitespace.
clean_lines = [line for line in lines if line.strip()]
# 3. Join the formatted lines with newline codes
formatted_text = "\n".join(clean_lines)
# Output result
print("--- Data Before Formatting (For Check) ---")
print(f"'{raw_address_data}'")
print("\n--- Data After Formatting ---")
print(formatted_text)
if __name__ == "__main__":
main()
Code Explanation
Judgment using line.strip()
The strip() method removes all whitespace characters (spaces, tabs, newlines, etc.) from the beginning and end of a string.
- Line with text:
" Tokyo "->"Tokyo"(Evaluated as True) - Empty line or line with only spaces:
" "->""(Evaluated as False)
By utilizing this property and writing if line.strip():, you can filter out lines that contain no substantial data.
Advantage of splitlines()
If you use split('\n'), unintended behavior may occur, such as an empty string being added to the end of the list if the last line contains a newline code. splitlines() handles these edge cases appropriately, making it more suitable for line-by-line processing.
This process is useful for many text processing tasks, such as pre-processing for CSV file loading or normalizing user input text.
