[Python] Creating Histograms and Visualizing Data Distribution with Matplotlib

目次

Overview

This recipe uses Matplotlib’s hist method to create histograms (frequency distribution plots). We will explain how to draw clear histograms by adjusting parameters like the number of bins, colors, borders, and transparency to understand data variance and bias.

Specifications (Input/Output)

  • Input: 1D numerical data array (list or NumPy array), bin settings, and style settings.
  • Output: A histogram plot (frequency or probability density).
  • Requirements: matplotlib and numpy libraries must be installed.

Basic Usage

import matplotlib.pyplot as plt
import numpy as np

# Prepare data (e.g., random numbers following a normal distribution)
data = np.random.randn(1000)

fig, ax = plt.subplots()
# The simplest histogram
ax.hist(data)

plt.show()

Full Code Example

This is a complete code example that creates normal distribution data with a mean of 50 and a standard deviation of 10, and draws an easy-to-read histogram with clear bar boundaries.

import matplotlib.pyplot as plt
import numpy as np

def main():
    # 1. Generate data (e.g., 2000 product weight samples)
    # Mean=50.0, Standard Deviation=10.0
    np.random.seed(42) # Fix seed for reproducibility
    weights = np.random.normal(50.0, 10.0, 2000)

    # 2. Create the drawing area
    fig, ax = plt.subplots(figsize=(8, 5))

    # 3. Draw histogram
    # Setting density=True makes the Y-axis "probability density" (total area sums to 1)
    ax.hist(
        x=weights,              # Data
        bins=25,                # Number of bars (bins)
        density=False,          # True for probability density, False for frequency (count)
        color='#00AAFF',        # Fill color of the bars
        ec='black',             # Border color (abbreviation of edgecolor)
        alpha=0.6,              # Transparency
        label='Samples'
    )

    # 4. Set labels and title
    ax.set_title("Product Weight Distribution")
    ax.set_xlabel("Weight (g)")
    ax.set_ylabel("Frequency") # "Probability Density" if density=True

    # Display the mean value with a vertical line (optional)
    mean_val = np.mean(weights)
    ax.axvline(mean_val, color='red', linestyle='--', label=f'Mean: {mean_val:.1f}')

    ax.legend()
    ax.grid(axis='y', alpha=0.5)

    # 5. Display
    plt.show()

if __name__ == "__main__":
    main()

Customization Points

These are the main parameters that determine the shape and style of the histogram in ax.hist().

Parameter NameDescriptionSetting Example
xInput data for creating the histogram (1D array). Required.data
binsNumber of bars (bins), or a list of bin boundary values.20 (int), [0, 10, 20] (list)
densityIf True, the vertical axis represents probability density (integrates to 1). Default is False (frequency).True, False
colorFill color of the bars.'skyblue', '#FF5733'
ec (edgecolor)Border color of the bars. Specifying this makes the boundaries between bars clear.'black', 'white'
alphaTransparency (0.0 to 1.0). Useful when overlaying multiple distributions.0.5
  • Importance of bins: If there are too few bins, it will be too rough. If there are too many, it will be sparse. Adjust according to the amount of data (a rule of thumb is the square root of the number of data points).
  • ec (edgecolor): Matplotlib has no borders by default. If adjacent bars are the same color, they look connected. It is recommended to specify something like ec='black' to make the boundaries clear.

Important Notes

  • Meaning of density=True: If you set density=True, the Y-axis value becomes “probability density”, and the total area of all bars sums to 1. Note that the sum of the Y-axis heights is not 1.
  • Missing Data: If the data contains NaN (missing values), it may cause errors or unexpected behavior. We recommend removing them beforehand using pandas or NumPy.
  • Automatic bins setting: Specifying bins='auto' lets NumPy automatically calculate the optimal number of bins based on the data distribution.

Variations

Overlapping Two Distributions for Comparison

An example of comparing the distribution status of different datasets by utilizing transparency (alpha).

import matplotlib.pyplot as plt
import numpy as np

def compare_histograms():
    # Generate two datasets with different means
    data_a = np.random.normal(40, 5, 1000)
    data_b = np.random.normal(55, 10, 1000)

    fig, ax = plt.subplots()

    # Specify alpha to make the overlap visible
    ax.hist(data_a, bins=30, alpha=0.5, label='Group A', color='blue', ec='blue')
    ax.hist(data_b, bins=30, alpha=0.5, label='Group B', color='orange', ec='orange')

    ax.set_title("Comparison of Two Groups")
    ax.legend()
    plt.show()

if __name__ == "__main__":
    compare_histograms()

Summary

Histograms are a basic tool for understanding the overall picture of your data. By not just drawing them, but also adjusting the number of bins and setting ec (border) and alpha (transparency), you can communicate the characteristics of the data more accurately.

よかったらシェアしてね!
  • URLをコピーしました!
  • URLをコピーしました!

この記事を書いた人

私が勉強したこと、実践したこと、してることを書いているブログです。
主に資産運用について書いていたのですが、
最近はプログラミングに興味があるので、今はそればっかりです。

目次