Overview
This recipe uses Matplotlib’s hist method to create histograms (frequency distribution plots). We will explain how to draw clear histograms by adjusting parameters like the number of bins, colors, borders, and transparency to understand data variance and bias.
Specifications (Input/Output)
- Input: 1D numerical data array (list or NumPy array), bin settings, and style settings.
- Output: A histogram plot (frequency or probability density).
- Requirements:
matplotlibandnumpylibraries must be installed.
Basic Usage
import matplotlib.pyplot as plt
import numpy as np
# Prepare data (e.g., random numbers following a normal distribution)
data = np.random.randn(1000)
fig, ax = plt.subplots()
# The simplest histogram
ax.hist(data)
plt.show()
Full Code Example
This is a complete code example that creates normal distribution data with a mean of 50 and a standard deviation of 10, and draws an easy-to-read histogram with clear bar boundaries.
import matplotlib.pyplot as plt
import numpy as np
def main():
# 1. Generate data (e.g., 2000 product weight samples)
# Mean=50.0, Standard Deviation=10.0
np.random.seed(42) # Fix seed for reproducibility
weights = np.random.normal(50.0, 10.0, 2000)
# 2. Create the drawing area
fig, ax = plt.subplots(figsize=(8, 5))
# 3. Draw histogram
# Setting density=True makes the Y-axis "probability density" (total area sums to 1)
ax.hist(
x=weights, # Data
bins=25, # Number of bars (bins)
density=False, # True for probability density, False for frequency (count)
color='#00AAFF', # Fill color of the bars
ec='black', # Border color (abbreviation of edgecolor)
alpha=0.6, # Transparency
label='Samples'
)
# 4. Set labels and title
ax.set_title("Product Weight Distribution")
ax.set_xlabel("Weight (g)")
ax.set_ylabel("Frequency") # "Probability Density" if density=True
# Display the mean value with a vertical line (optional)
mean_val = np.mean(weights)
ax.axvline(mean_val, color='red', linestyle='--', label=f'Mean: {mean_val:.1f}')
ax.legend()
ax.grid(axis='y', alpha=0.5)
# 5. Display
plt.show()
if __name__ == "__main__":
main()
Customization Points
These are the main parameters that determine the shape and style of the histogram in ax.hist().
| Parameter Name | Description | Setting Example |
| x | Input data for creating the histogram (1D array). Required. | data |
| bins | Number of bars (bins), or a list of bin boundary values. | 20 (int), [0, 10, 20] (list) |
| density | If True, the vertical axis represents probability density (integrates to 1). Default is False (frequency). | True, False |
| color | Fill color of the bars. | 'skyblue', '#FF5733' |
| ec (edgecolor) | Border color of the bars. Specifying this makes the boundaries between bars clear. | 'black', 'white' |
| alpha | Transparency (0.0 to 1.0). Useful when overlaying multiple distributions. | 0.5 |
- Importance of bins: If there are too few bins, it will be too rough. If there are too many, it will be sparse. Adjust according to the amount of data (a rule of thumb is the square root of the number of data points).
- ec (edgecolor): Matplotlib has no borders by default. If adjacent bars are the same color, they look connected. It is recommended to specify something like
ec='black'to make the boundaries clear.
Important Notes
- Meaning of density=True: If you set
density=True, the Y-axis value becomes “probability density”, and the total area of all bars sums to 1. Note that the sum of the Y-axis heights is not 1. - Missing Data: If the data contains NaN (missing values), it may cause errors or unexpected behavior. We recommend removing them beforehand using pandas or NumPy.
- Automatic bins setting: Specifying
bins='auto'lets NumPy automatically calculate the optimal number of bins based on the data distribution.
Variations
Overlapping Two Distributions for Comparison
An example of comparing the distribution status of different datasets by utilizing transparency (alpha).
import matplotlib.pyplot as plt
import numpy as np
def compare_histograms():
# Generate two datasets with different means
data_a = np.random.normal(40, 5, 1000)
data_b = np.random.normal(55, 10, 1000)
fig, ax = plt.subplots()
# Specify alpha to make the overlap visible
ax.hist(data_a, bins=30, alpha=0.5, label='Group A', color='blue', ec='blue')
ax.hist(data_b, bins=30, alpha=0.5, label='Group B', color='orange', ec='orange')
ax.set_title("Comparison of Two Groups")
ax.legend()
plt.show()
if __name__ == "__main__":
compare_histograms()
Summary
Histograms are a basic tool for understanding the overall picture of your data. By not just drawing them, but also adjusting the number of bins and setting ec (border) and alpha (transparency), you can communicate the characteristics of the data more accurately.
