statistics - Statistical Functions
The statistics module provides functions for calculating mathematical statistics of numeric (Real-valued) data. This module is essential for data analysis, statistical computations, and scientific programming tasks requiring descriptive statistics.
Official Documentation: Python statistics Module
Tutorial Reference: Numeric Types
📚 Basic Usage
Simple Example
import statistics
# Basic statistical measures
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]
print(statistics.mean(data)) # 5.0 - Arithmetic mean
print(statistics.median(data)) # 5 - Middle value
print(statistics.mode([1, 1, 2, 3, 3, 3, 4])) # 3 - Most common value
print(statistics.stdev(data)) # 2.7386127875258306 - Standard deviation
# Different types of means
print(statistics.fmean(data)) # 5.0 - Fast floating-point arithmetic mean
print(statistics.geometric_mean(data)) # 4.147166274396913 - Geometric mean
print(statistics.harmonic_mean(data)) # 3.181371861411137 - Harmonic mean
Core Functions
import statistics
# Sample data
scores = [85, 90, 88, 92, 87, 91, 89, 94, 86, 93]
# Central tendency measures
mean_score = statistics.mean(scores) # 89.5
median_score = statistics.median(scores) # 89.5
mode_score = statistics.mode([85, 85, 90, 90, 90]) # 90
# Spread measures
variance = statistics.variance(scores) # 9.166666666666666
std_dev = statistics.stdev(scores) # 3.027650354097491
pop_var = statistics.pvariance(scores) # 8.25
pop_std = statistics.pstdev(scores) # 2.8722813232690143
print(f"Mean: {mean_score}, Std Dev: {std_dev:.2f}")
Common Patterns
# Pattern 1: Basic descriptive statistics
def describe_data(data):
"""Calculate basic descriptive statistics for a dataset."""
return {
'count': len(data),
'mean': statistics.mean(data),
'median': statistics.median(data),
'std_dev': statistics.stdev(data) if len(data) > 1 else 0,
'min': min(data),
'max': max(data)
}
# Pattern 2: Robust statistics with error handling
def safe_statistics(data):
"""Calculate statistics with error handling for edge cases."""
if not data:
return None
try:
return {
'mean': statistics.mean(data),
'median': statistics.median(data),
'mode': statistics.mode(data) if len(set(data)) < len(data) else None
}
except statistics.StatisticsError as e:
print(f"Statistics error: {e}")
return None
# Pattern 3: Comparative analysis
def compare_datasets(data1, data2):
"""Compare two datasets using various statistical measures."""
return {
'correlation': statistics.correlation(data1, data2),
'covariance': statistics.covariance(data1, data2),
'mean_diff': statistics.mean(data1) - statistics.mean(data2)
}
🔧 Statistics API Reference
Measures of Central Tendency
| Function | Description | Return Type | Example |
|---|---|---|---|
mean(data) | Arithmetic mean (average) | float | mean([1,2,3,4,5]) → 3.0 |
fmean(data) | Fast floating-point arithmetic mean | float | fmean([1,2,3,4,5]) → 3.0 |
geometric_mean(data) | Geometric mean | float | geometric_mean([2,4,8]) → 4.0 |
harmonic_mean(data) | Harmonic mean | float | harmonic_mean([2,4,4]) → 3.2 |
median(data) | Middle value | float | median([1,2,3,4,5]) → 3 |
median_low(data) | Low median of data | Same as input | median_low([1,2,3,4]) → 2 |
median_high(data) | High median of data | Same as input | median_high([1,2,3,4]) → 3 |
median_grouped(data, interval=1) | Median of grouped continuous data | float | median_grouped([1,2,3,4], 1) → 2.5 |
mode(data) | Most common value | Same as input | mode([1,1,2,3]) → 1 |
multimode(data) | List of most common values | list | multimode([1,1,2,2,3]) → [1,2] |
Measures of Spread
| Function | Description | Return Type | Example |
|---|---|---|---|
pstdev(data, mu=None) | Population standard deviation | float | pstdev([1,2,3,4,5]) → 1.58... |
pvariance(data, mu=None) | Population variance | float | pvariance([1,2,3,4,5]) → 2.5 |
stdev(data, xbar=None) | Sample standard deviation | float | stdev([1,2,3,4,5]) → 1.58... |
variance(data, xbar=None) | Sample variance | float | variance([1,2,3,4,5]) → 2.5 |
quantiles(data, *, n=4, method='exclusive') | Divide data into equal probability intervals | list | quantiles([1,2,3,4,5], n=4) |
Relationships Between Variables
| Function | Description | Return Type | Example |
|---|---|---|---|
correlation(x, y) | Pearson correlation coefficient | float | correlation([1,2,3], [1,2,3]) → 1.0 |
covariance(x, y) | Sample covariance | float | covariance([1,2,3], [1,2,3]) → 1.0 |
linear_regression(x, y) | Linear regression parameters | LinearRegression | linear_regression([1,2,3], [2,4,6]) |
Classes
| Class | Description | Key Methods |
|---|---|---|
NormalDist(mu=0.0, sigma=1.0) | Normal distribution | pdf(), cdf(), inv_cdf(), samples() |
LinearRegression | Linear regression result | slope, intercept |
StatisticsError | Statistics-specific exception | - |
Detailed Function Examples
Central Tendency Functions
import statistics
# Different means for different contexts
data = [1, 2, 4, 8, 16]
# Arithmetic mean - good for additive data
arithmetic = statistics.mean(data) # 6.2
# Geometric mean - good for multiplicative data (growth rates)
geometric = statistics.geometric_mean(data) # 4.0
# Harmonic mean - good for rates and ratios
rates = [10, 20, 30] # km/h
harmonic = statistics.harmonic_mean(rates) # 16.36... km/h average
# Fast mean for large floating-point datasets
large_data = list(range(10000))
fast_mean = statistics.fmean(large_data) # Optimized for speed
Median Functions
import statistics
# Various median calculations
data = [1, 2, 3, 4, 5, 6]
median = statistics.median(data) # 3.5 (average of 3 and 4)
median_low = statistics.median_low(data) # 3 (lower middle value)
median_high = statistics.median_high(data) # 4 (higher middle value)
# Grouped median for continuous data
grouped = statistics.median_grouped([1, 2, 3, 4, 5, 6], interval=2)
Mode Functions
import statistics
# Single mode
single_mode_data = [1, 1, 2, 3, 3, 3, 4]
mode = statistics.mode(single_mode_data) # 3
# Multiple modes
multi_mode_data = [1, 1, 2, 2, 3]
modes = statistics.multimode(multi_mode_data) # [1, 2]
# Mode with strings
text_data = ['apple', 'banana', 'apple', 'cherry', 'apple']
text_mode = statistics.mode(text_data) # 'apple'
Important Notes
- Most functions accept any iterable of numeric values
fmean()is faster thanmean()for floating-point data but less precise forDecimalandFraction- Population functions (
pstdev,pvariance) usenas denominator; sample functions usen-1 mode()raisesStatisticsErrorif no unique mode exists; usemultimode()for multiple modes- Data can be any numeric type:
int,float,Decimal,Fraction
🐛 Common Errors and Troubleshooting
Typical Error Messages
import statistics
# Error 1: StatisticsError - No unique mode
try:
statistics.mode([1, 2, 3, 4]) # All values equally common
except statistics.StatisticsError as e:
print(f"Mode error: {e}") # "no unique mode; found 4 equally common values"
# Error 2: StatisticsError - Empty dataset
try:
statistics.mean([])
except statistics.StatisticsError as e:
print(f"Mean error: {e}") # "mean requires at least one data point"
# Error 3: TypeError - Invalid data types
try:
statistics.mean(['a', 'b', 'c'])
except TypeError as e:
print(f"Type error: {e}") # Unsupported operand types
Debugging Tips
import statistics
def safe_mean(data):
"""Calculate mean with comprehensive error handling."""
if not data:
print("Warning: Empty dataset")
return None
try:
# Check for numeric data
numeric_data = [float(x) for x in data]
return statistics.mean(numeric_data)
except (TypeError, ValueError) as e:
print(f"Data conversion error: {e}")
return None
except statistics.StatisticsError as e:
print(f"Statistics error: {e}")
return None
# Usage
result = safe_mean([1, 2, 3, '4', 5]) # Handles string numbers
Error Handling Patterns
import statistics
def robust_statistics(data):
"""Calculate statistics with proper error handling."""
results = {}
if not data:
return {"error": "Empty dataset"}
try:
results['mean'] = statistics.mean(data)
results['median'] = statistics.median(data)
# Handle mode specially
try:
results['mode'] = statistics.mode(data)
except statistics.StatisticsError:
results['modes'] = statistics.multimode(data)
if len(data) > 1:
results['stdev'] = statistics.stdev(data)
except (TypeError, ValueError) as e:
results['error'] = f"Data type error: {e}"
except statistics.StatisticsError as e:
results['error'] = f"Statistics error: {e}"
return results
🎯 Primary Use Cases
1. Data Analysis and Exploration
Use Case: Analyzing experimental data or survey results for initial insights Why statistics: Provides comprehensive descriptive statistics in a simple, standardized way Code Example:
import statistics
# Survey response analysis
survey_scores = [7, 8, 6, 9, 7, 8, 5, 9, 8, 7, 6, 9, 8, 7]
def analyze_survey(scores):
"""Comprehensive survey analysis."""
analysis = {
'sample_size': len(scores),
'mean_score': statistics.mean(scores),
'median_score': statistics.median(scores),
'std_deviation': statistics.stdev(scores),
'score_range': max(scores) - min(scores)
}
# Calculate quartiles
quartiles = statistics.quantiles(scores, n=4)
analysis['q1'], analysis['q3'] = quartiles[0], quartiles[2]
analysis['iqr'] = analysis['q3'] - analysis['q1']
return analysis
results = analyze_survey(survey_scores)
print(f"Mean satisfaction: {results['mean_score']:.2f}")
print(f"Standard deviation: {results['std_deviation']:.2f}")
2. Quality Control and Process Monitoring
Use Case: Monitoring manufacturing processes or service quality metrics Why statistics: Built-in functions for control charts and process capability analysis Code Example:
import statistics
class QualityMonitor:
"""Monitor process quality using statistical control."""
def __init__(self, historical_data):
self.mean = statistics.mean(historical_data)
self.std_dev = statistics.stdev(historical_data)
self.upper_limit = self.mean + 3 * self.std_dev
self.lower_limit = self.mean - 3 * self.std_dev
def check_process(self, new_measurements):
"""Check if process is in statistical control."""
out_of_control = []
for i, measurement in enumerate(new_measurements):
if measurement > self.upper_limit or measurement < self.lower_limit:
out_of_control.append((i, measurement))
current_mean = statistics.mean(new_measurements)
drift = abs(current_mean - self.mean) / self.std_dev
return {
'in_control': len(out_of_control) == 0,
'outliers': out_of_control,
'process_drift': drift,
'current_mean': current_mean
}
# Usage
historical = [10.2, 10.1, 9.9, 10.3, 10.0, 9.8, 10.1, 10.2]
monitor = QualityMonitor(historical)
new_batch = [10.1, 10.5, 9.7, 10.2, 11.0] # One outlier
status = monitor.check_process(new_batch)
3. A/B Testing and Experimental Analysis
Use Case: Comparing the performance of different versions or treatments Why statistics: Correlation and covariance functions for relationship analysis Code Example:
import statistics
def ab_test_analysis(control_group, treatment_group):
"""Analyze A/B test results with statistical measures."""
# Basic descriptive statistics
control_stats = {
'mean': statistics.mean(control_group),
'std': statistics.stdev(control_group),
'median': statistics.median(control_group)
}
treatment_stats = {
'mean': statistics.mean(treatment_group),
'std': statistics.stdev(treatment_group),
'median': statistics.median(treatment_group)
}
# Calculate effect size (Cohen's d approximation)
pooled_std = ((len(control_group) - 1) * control_stats['std']**2 +
(len(treatment_group) - 1) * treatment_stats['std']**2) / \
(len(control_group) + len(treatment_group) - 2)
pooled_std = pooled_std ** 0.5
effect_size = (treatment_stats['mean'] - control_stats['mean']) / pooled_std
return {
'control': control_stats,
'treatment': treatment_stats,
'lift': (treatment_stats['mean'] - control_stats['mean']) / control_stats['mean'],
'effect_size': effect_size
}
# Example: Website conversion rates (%)
control_conversions = [2.1, 2.3, 1.9, 2.4, 2.0, 2.2, 1.8, 2.5]
treatment_conversions = [2.8, 3.1, 2.9, 3.0, 2.7, 3.2, 2.6, 3.3]
results = ab_test_analysis(control_conversions, treatment_conversions)
print(f"Conversion lift: {results['lift']:.1%}")
4. Normal Distribution Analysis and Modeling
Use Case: Modeling data that follows normal distribution patterns
Why statistics: Built-in NormalDist class for probability calculations
Code Example:
import statistics
# Create normal distribution from sample data
heights = [165, 170, 175, 168, 172, 169, 174, 171, 167, 173]
height_dist = statistics.NormalDist.from_samples(heights)
print(f"Mean height: {height_dist.mean:.1f} cm")
print(f"Standard deviation: {height_dist.stdev:.1f} cm")
# Probability calculations
prob_tall = 1 - height_dist.cdf(180) # P(height > 180)
prob_range = height_dist.cdf(175) - height_dist.cdf(165) # P(165 < height < 175)
print(f"Probability of height > 180cm: {prob_tall:.1%}")
print(f"Probability of height 165-175cm: {prob_range:.1%}")
# Generate samples from the distribution
samples = height_dist.samples(100) # Generate 100 random samples
# Compare two normal distributions
male_heights = statistics.NormalDist(175, 8)
female_heights = statistics.NormalDist(162, 7)
# Overlap between distributions
overlap = male_heights.overlap(female_heights)
print(f"Distribution overlap: {overlap:.1%}")
Performance Considerations
Time Complexity Summary
| Operation | Time Complexity | Notes |
|---|---|---|
mean(), fmean() | O(n) | Single pass through data |
median() | O(n log n) | Requires sorting |
mode() | O(n) | Single pass with counting |
stdev(), variance() | O(n) | Two passes: mean calculation + variance |
quantiles() | O(n log n) | Requires sorting |
correlation() | O(n) | Three passes through data |
Basic Benchmarking
import statistics
import timeit
# Performance comparison for mean calculations
data = list(range(100000))
# Time different mean functions
mean_time = timeit.timeit(lambda: statistics.mean(data), number=100)
fmean_time = timeit.timeit(lambda: statistics.fmean(data), number=100)
builtin_sum_time = timeit.timeit(lambda: sum(data) / len(data), number=100)
print(f"statistics.mean(): {mean_time:.4f}s")
print(f"statistics.fmean(): {fmean_time:.4f}s") # Fastest
print(f"sum()/len(): {builtin_sum_time:.4f}s")
# Memory efficiency test
def memory_efficient_stats(data_generator):
"""Calculate stats without loading all data into memory."""
# For streaming data, accumulate values
total, count, values = 0, 0, []
for value in data_generator():
total += value
count += 1
values.append(value) # Store for median calculation
return {
'mean': total / count,
'median': statistics.median(values),
'count': count
}
Memory Usage Tips
- Use
fmean()instead ofmean()for large floating-point datasets - For streaming data, calculate mean incrementally:
total/count median()requires storing all data; consider usingquantiles()for large datasetsNormalDistuses minimal memory for probability calculations vs storing large samples
🎯 When to Use statistics
✅ Ideal Use Cases
- Quick descriptive statistics: Built-in functions for common statistical measures
- Data exploration: Initial analysis of datasets for insights and patterns
- Quality control: Process monitoring and control chart calculations
- A/B testing: Comparing groups with correlation and mean comparisons
- Normal distribution modeling: Built-in
NormalDistclass for probability calculations - Educational purposes: Teaching statistical concepts with clear, readable code
- Scientific computing: When you need accurate statistical calculations
- Small to medium datasets: Efficient for datasets that fit in memory
❌ When NOT to Use statistics
- Large-scale data analysis: Use NumPy/Pandas for better performance with large arrays
- Advanced statistical tests: Use SciPy for hypothesis testing, regression analysis
- Machine learning: Use scikit-learn for predictive modeling and advanced algorithms
- Time series analysis: Use pandas or specialized libraries like statsmodels
- Streaming data: Limited support for online/incremental calculations
- Multi-dimensional data: No built-in support for matrices or multi-dimensional arrays
- Complex probability distributions: Limited to normal distribution only
Alternative Solutions
- NumPy: High-performance arrays and mathematical functions
- Pandas: Data manipulation with built-in statistical methods
- SciPy: Advanced statistical functions and hypothesis testing
- scikit-learn: Machine learning algorithms and advanced statistics
- statsmodels: Econometric and statistical modeling
- Custom implementations: For specific requirements or streaming data
Additional Learning Resources
Official Python Resources (PRIMARY SOURCES)
- Library Documentation: statistics module - Complete function reference and examples
- Tutorial: Python Tutorial - Numbers - Basic numeric operations
- PEP 450: Adding a statistics module to the Python standard library - Design rationale and implementation details
- What's New: Python 3.4+ - Statistics module introduction and updates
Books and Publications
- "Think Stats" by Allen B. Downey - Statistics and probability for programmers
- "Statistics for Hackers" by Jake VanderPlas - Modern statistical analysis with Python
- "Python for Data Analysis" by Wes McKinney - Comprehensive data analysis with pandas and NumPy
- "Introduction to Statistical Learning with Python" - Statistical learning concepts with Python implementations
Online Tutorials and Courses
- Real Python: Python statistics module tutorial - Comprehensive guide with examples
- DataCamp: Python statistics courses - Interactive lessons with hands-on practice
- Coursera: "Statistics with Python" specialization - University-level statistical analysis
- Khan Academy: Statistics and probability - Foundation concepts with visual explanations
Practice and Examples
- Kaggle Learn: Statistics course - Free micro-courses with real-world datasets
- HackerRank: Statistics challenges - Programming problems involving statistical calculations
- LeetCode: Math and statistics problems - Coding interview preparation
- GitHub: Awesome Statistics - Curated list of statistical resources
Advanced Topics
- SciPy Documentation: Statistical functions - Advanced statistical tests and distributions
- NumPy Documentation: Mathematical functions - Array-based mathematical operations
- Pandas Documentation: Computational tools - DataFrame statistical methods
- Matplotlib/Seaborn: Data visualization for statistical analysis
Community Resources
- r/statistics: Reddit community for statistical discussions and help
- Cross Validated: Stack Exchange site for statistics questions
- Python.org: SIG for Scientific Computing - Special interest group discussions
- PyData: Global community for Python data science practitioners