[Python] How to Extract and Unpack Tar Files: Using the tarfile Module

To handle archive files such as .tar, .tar.gz, and .tgz in Python, use the standard library tarfile module.

This format is frequently used for server management and data exchange in Linux environments.

目次

What is a Tar File?

“Tar (Tape Archive)” is an archive format that groups multiple files into a single file.

The Tar format itself does not have data compression capabilities, so it is usually combined with compression formats like gzip or bzip2. Therefore, extensions are commonly .tar.gz or .tar.bz2 rather than just .tar.

List of Open Function Modes

The main read modes you can specify when opening a file with tarfile.open() are as follows.

Usually, ‘r:*’ (automatic detection) is used, but you can also specify the mode explicitly.

Mode StringMeaning / Usage
'r' or 'r:*'Automatic detection read (Recommended). Automatically recognizes compression formats (gzip, bzip2, lzma).
'r:gz'Reads gzip compressed files (extensions .tar.gz, .tgz, etc.).
'r:bz2'Reads bzip2 compressed files (extensions .tar.bz2, .tbz, etc.).
'r:xz'Reads lzma (xz) compressed files (extensions .tar.xz, etc.).

Implementation Example: Extracting Server Logs (.tar.gz)

In this example, we will implement a process to extract a backed-up web server log file named web_logs_backup.tar.gz into a directory called log_analysis.

Source Code

import tarfile
import os

# 1. File to extract and destination directory
archive_path = "web_logs_backup.tar.gz"
extract_to = "log_analysis"

print(f"--- Extracting archive '{archive_path}' ---")

# Create the destination directory if it doesn't exist (to prevent errors)
if not os.path.exists(extract_to):
    os.makedirs(extract_to)

# 2. Open the tar file
# mode="r:gz" explicitly specifies gzip compression ("r" also works for auto-detection)
# encoding="utf-8" specifies the encoding for filenames inside the tar
try:
    with tarfile.open(archive_path, mode="r:gz") as tar:
        
        # Display contained filenames (for verification)
        print("Files included:")
        for member in tar.getnames():
            print(f" - {member}")
        
        # 3. Extract all files to the specified directory
        # filter='data' is a recommended security setting for Python 3.12+
        # (To prevent directory traversal attacks, etc.)
        if hasattr(tarfile, 'data_filter'):
             tar.extractall(path=extract_to, filter='data')
        else:
             # For older versions of Python
             tar.extractall(path=extract_to)
            
    print(f"\nExtraction complete: {extract_to}/")

except FileNotFoundError:
    print(f"Error: File '{archive_path}' not found.")
except tarfile.ReadError:
    print("Error: Invalid or corrupted file format.")

Execution Result

--- Extracting archive 'web_logs_backup.tar.gz' ---
Files included:
 - access.log
 - error.log
 - app/debug.log

Extraction complete: log_analysis/

Explanation

tarfile.open(name, mode)

This opens the archive. Using the with block ensures the file is automatically closed after processing is finished.

extractall(path)

This extracts all files within the archive to the specified path.

Security (filter=’data’)

Starting from Python 3.12, specifying the filter argument is recommended to prevent carelessly extracting tar files containing malicious paths (such as ../../etc/passwd). For general data files, specifying filter='data' is safe.

よかったらシェアしてね!
  • URLをコピーしました!
  • URLをコピーしました!

この記事を書いた人

私が勉強したこと、実践したこと、してることを書いているブログです。
主に資産運用について書いていたのですが、
最近はプログラミングに興味があるので、今はそればっかりです。

目次