Skip to main content

statistics - Statistical Functions

The statistics module provides functions for calculating mathematical statistics of numeric (Real-valued) data. This module is essential for data analysis, statistical computations, and scientific programming tasks requiring descriptive statistics.

Official Documentation: Python statistics Module
Tutorial Reference: Numeric Types

📚 Basic Usage

Simple Example

import statistics

# Basic statistical measures
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]

print(statistics.mean(data)) # 5.0 - Arithmetic mean
print(statistics.median(data)) # 5 - Middle value
print(statistics.mode([1, 1, 2, 3, 3, 3, 4])) # 3 - Most common value
print(statistics.stdev(data)) # 2.7386127875258306 - Standard deviation

# Different types of means
print(statistics.fmean(data)) # 5.0 - Fast floating-point arithmetic mean
print(statistics.geometric_mean(data)) # 4.147166274396913 - Geometric mean
print(statistics.harmonic_mean(data)) # 3.181371861411137 - Harmonic mean

Core Functions

import statistics

# Sample data
scores = [85, 90, 88, 92, 87, 91, 89, 94, 86, 93]

# Central tendency measures
mean_score = statistics.mean(scores) # 89.5
median_score = statistics.median(scores) # 89.5
mode_score = statistics.mode([85, 85, 90, 90, 90]) # 90

# Spread measures
variance = statistics.variance(scores) # 9.166666666666666
std_dev = statistics.stdev(scores) # 3.027650354097491
pop_var = statistics.pvariance(scores) # 8.25
pop_std = statistics.pstdev(scores) # 2.8722813232690143

print(f"Mean: {mean_score}, Std Dev: {std_dev:.2f}")

Common Patterns

# Pattern 1: Basic descriptive statistics
def describe_data(data):
"""Calculate basic descriptive statistics for a dataset."""
return {
'count': len(data),
'mean': statistics.mean(data),
'median': statistics.median(data),
'std_dev': statistics.stdev(data) if len(data) > 1 else 0,
'min': min(data),
'max': max(data)
}

# Pattern 2: Robust statistics with error handling
def safe_statistics(data):
"""Calculate statistics with error handling for edge cases."""
if not data:
return None

try:
return {
'mean': statistics.mean(data),
'median': statistics.median(data),
'mode': statistics.mode(data) if len(set(data)) < len(data) else None
}
except statistics.StatisticsError as e:
print(f"Statistics error: {e}")
return None

# Pattern 3: Comparative analysis
def compare_datasets(data1, data2):
"""Compare two datasets using various statistical measures."""
return {
'correlation': statistics.correlation(data1, data2),
'covariance': statistics.covariance(data1, data2),
'mean_diff': statistics.mean(data1) - statistics.mean(data2)
}

🔧 Statistics API Reference

Measures of Central Tendency

FunctionDescriptionReturn TypeExample
mean(data)Arithmetic mean (average)floatmean([1,2,3,4,5])3.0
fmean(data)Fast floating-point arithmetic meanfloatfmean([1,2,3,4,5])3.0
geometric_mean(data)Geometric meanfloatgeometric_mean([2,4,8])4.0
harmonic_mean(data)Harmonic meanfloatharmonic_mean([2,4,4])3.2
median(data)Middle valuefloatmedian([1,2,3,4,5])3
median_low(data)Low median of dataSame as inputmedian_low([1,2,3,4])2
median_high(data)High median of dataSame as inputmedian_high([1,2,3,4])3
median_grouped(data, interval=1)Median of grouped continuous datafloatmedian_grouped([1,2,3,4], 1)2.5
mode(data)Most common valueSame as inputmode([1,1,2,3])1
multimode(data)List of most common valueslistmultimode([1,1,2,2,3])[1,2]

Measures of Spread

FunctionDescriptionReturn TypeExample
pstdev(data, mu=None)Population standard deviationfloatpstdev([1,2,3,4,5])1.58...
pvariance(data, mu=None)Population variancefloatpvariance([1,2,3,4,5])2.5
stdev(data, xbar=None)Sample standard deviationfloatstdev([1,2,3,4,5])1.58...
variance(data, xbar=None)Sample variancefloatvariance([1,2,3,4,5])2.5
quantiles(data, *, n=4, method='exclusive')Divide data into equal probability intervalslistquantiles([1,2,3,4,5], n=4)

Relationships Between Variables

FunctionDescriptionReturn TypeExample
correlation(x, y)Pearson correlation coefficientfloatcorrelation([1,2,3], [1,2,3])1.0
covariance(x, y)Sample covariancefloatcovariance([1,2,3], [1,2,3])1.0
linear_regression(x, y)Linear regression parametersLinearRegressionlinear_regression([1,2,3], [2,4,6])

Classes

ClassDescriptionKey Methods
NormalDist(mu=0.0, sigma=1.0)Normal distributionpdf(), cdf(), inv_cdf(), samples()
LinearRegressionLinear regression resultslope, intercept
StatisticsErrorStatistics-specific exception-

Detailed Function Examples

Central Tendency Functions

import statistics

# Different means for different contexts
data = [1, 2, 4, 8, 16]

# Arithmetic mean - good for additive data
arithmetic = statistics.mean(data) # 6.2

# Geometric mean - good for multiplicative data (growth rates)
geometric = statistics.geometric_mean(data) # 4.0

# Harmonic mean - good for rates and ratios
rates = [10, 20, 30] # km/h
harmonic = statistics.harmonic_mean(rates) # 16.36... km/h average

# Fast mean for large floating-point datasets
large_data = list(range(10000))
fast_mean = statistics.fmean(large_data) # Optimized for speed

Median Functions

import statistics

# Various median calculations
data = [1, 2, 3, 4, 5, 6]

median = statistics.median(data) # 3.5 (average of 3 and 4)
median_low = statistics.median_low(data) # 3 (lower middle value)
median_high = statistics.median_high(data) # 4 (higher middle value)

# Grouped median for continuous data
grouped = statistics.median_grouped([1, 2, 3, 4, 5, 6], interval=2)

Mode Functions

import statistics

# Single mode
single_mode_data = [1, 1, 2, 3, 3, 3, 4]
mode = statistics.mode(single_mode_data) # 3

# Multiple modes
multi_mode_data = [1, 1, 2, 2, 3]
modes = statistics.multimode(multi_mode_data) # [1, 2]

# Mode with strings
text_data = ['apple', 'banana', 'apple', 'cherry', 'apple']
text_mode = statistics.mode(text_data) # 'apple'

Important Notes

  • Most functions accept any iterable of numeric values
  • fmean() is faster than mean() for floating-point data but less precise for Decimal and Fraction
  • Population functions (pstdev, pvariance) use n as denominator; sample functions use n-1
  • mode() raises StatisticsError if no unique mode exists; use multimode() for multiple modes
  • Data can be any numeric type: int, float, Decimal, Fraction

🐛 Common Errors and Troubleshooting

Typical Error Messages

import statistics

# Error 1: StatisticsError - No unique mode
try:
statistics.mode([1, 2, 3, 4]) # All values equally common
except statistics.StatisticsError as e:
print(f"Mode error: {e}") # "no unique mode; found 4 equally common values"

# Error 2: StatisticsError - Empty dataset
try:
statistics.mean([])
except statistics.StatisticsError as e:
print(f"Mean error: {e}") # "mean requires at least one data point"

# Error 3: TypeError - Invalid data types
try:
statistics.mean(['a', 'b', 'c'])
except TypeError as e:
print(f"Type error: {e}") # Unsupported operand types

Debugging Tips

import statistics

def safe_mean(data):
"""Calculate mean with comprehensive error handling."""
if not data:
print("Warning: Empty dataset")
return None

try:
# Check for numeric data
numeric_data = [float(x) for x in data]
return statistics.mean(numeric_data)
except (TypeError, ValueError) as e:
print(f"Data conversion error: {e}")
return None
except statistics.StatisticsError as e:
print(f"Statistics error: {e}")
return None

# Usage
result = safe_mean([1, 2, 3, '4', 5]) # Handles string numbers

Error Handling Patterns

import statistics

def robust_statistics(data):
"""Calculate statistics with proper error handling."""
results = {}

if not data:
return {"error": "Empty dataset"}

try:
results['mean'] = statistics.mean(data)
results['median'] = statistics.median(data)

# Handle mode specially
try:
results['mode'] = statistics.mode(data)
except statistics.StatisticsError:
results['modes'] = statistics.multimode(data)

if len(data) > 1:
results['stdev'] = statistics.stdev(data)

except (TypeError, ValueError) as e:
results['error'] = f"Data type error: {e}"
except statistics.StatisticsError as e:
results['error'] = f"Statistics error: {e}"

return results

🎯 Primary Use Cases

1. Data Analysis and Exploration

Use Case: Analyzing experimental data or survey results for initial insights Why statistics: Provides comprehensive descriptive statistics in a simple, standardized way Code Example:

import statistics

# Survey response analysis
survey_scores = [7, 8, 6, 9, 7, 8, 5, 9, 8, 7, 6, 9, 8, 7]

def analyze_survey(scores):
"""Comprehensive survey analysis."""
analysis = {
'sample_size': len(scores),
'mean_score': statistics.mean(scores),
'median_score': statistics.median(scores),
'std_deviation': statistics.stdev(scores),
'score_range': max(scores) - min(scores)
}

# Calculate quartiles
quartiles = statistics.quantiles(scores, n=4)
analysis['q1'], analysis['q3'] = quartiles[0], quartiles[2]
analysis['iqr'] = analysis['q3'] - analysis['q1']

return analysis

results = analyze_survey(survey_scores)
print(f"Mean satisfaction: {results['mean_score']:.2f}")
print(f"Standard deviation: {results['std_deviation']:.2f}")

2. Quality Control and Process Monitoring

Use Case: Monitoring manufacturing processes or service quality metrics Why statistics: Built-in functions for control charts and process capability analysis Code Example:

import statistics

class QualityMonitor:
"""Monitor process quality using statistical control."""

def __init__(self, historical_data):
self.mean = statistics.mean(historical_data)
self.std_dev = statistics.stdev(historical_data)
self.upper_limit = self.mean + 3 * self.std_dev
self.lower_limit = self.mean - 3 * self.std_dev

def check_process(self, new_measurements):
"""Check if process is in statistical control."""
out_of_control = []

for i, measurement in enumerate(new_measurements):
if measurement > self.upper_limit or measurement < self.lower_limit:
out_of_control.append((i, measurement))

current_mean = statistics.mean(new_measurements)
drift = abs(current_mean - self.mean) / self.std_dev

return {
'in_control': len(out_of_control) == 0,
'outliers': out_of_control,
'process_drift': drift,
'current_mean': current_mean
}

# Usage
historical = [10.2, 10.1, 9.9, 10.3, 10.0, 9.8, 10.1, 10.2]
monitor = QualityMonitor(historical)

new_batch = [10.1, 10.5, 9.7, 10.2, 11.0] # One outlier
status = monitor.check_process(new_batch)

3. A/B Testing and Experimental Analysis

Use Case: Comparing the performance of different versions or treatments Why statistics: Correlation and covariance functions for relationship analysis Code Example:

import statistics

def ab_test_analysis(control_group, treatment_group):
"""Analyze A/B test results with statistical measures."""

# Basic descriptive statistics
control_stats = {
'mean': statistics.mean(control_group),
'std': statistics.stdev(control_group),
'median': statistics.median(control_group)
}

treatment_stats = {
'mean': statistics.mean(treatment_group),
'std': statistics.stdev(treatment_group),
'median': statistics.median(treatment_group)
}

# Calculate effect size (Cohen's d approximation)
pooled_std = ((len(control_group) - 1) * control_stats['std']**2 +
(len(treatment_group) - 1) * treatment_stats['std']**2) / \
(len(control_group) + len(treatment_group) - 2)
pooled_std = pooled_std ** 0.5

effect_size = (treatment_stats['mean'] - control_stats['mean']) / pooled_std

return {
'control': control_stats,
'treatment': treatment_stats,
'lift': (treatment_stats['mean'] - control_stats['mean']) / control_stats['mean'],
'effect_size': effect_size
}

# Example: Website conversion rates (%)
control_conversions = [2.1, 2.3, 1.9, 2.4, 2.0, 2.2, 1.8, 2.5]
treatment_conversions = [2.8, 3.1, 2.9, 3.0, 2.7, 3.2, 2.6, 3.3]

results = ab_test_analysis(control_conversions, treatment_conversions)
print(f"Conversion lift: {results['lift']:.1%}")

4. Normal Distribution Analysis and Modeling

Use Case: Modeling data that follows normal distribution patterns Why statistics: Built-in NormalDist class for probability calculations Code Example:

import statistics

# Create normal distribution from sample data
heights = [165, 170, 175, 168, 172, 169, 174, 171, 167, 173]
height_dist = statistics.NormalDist.from_samples(heights)

print(f"Mean height: {height_dist.mean:.1f} cm")
print(f"Standard deviation: {height_dist.stdev:.1f} cm")

# Probability calculations
prob_tall = 1 - height_dist.cdf(180) # P(height > 180)
prob_range = height_dist.cdf(175) - height_dist.cdf(165) # P(165 < height < 175)

print(f"Probability of height > 180cm: {prob_tall:.1%}")
print(f"Probability of height 165-175cm: {prob_range:.1%}")

# Generate samples from the distribution
samples = height_dist.samples(100) # Generate 100 random samples

# Compare two normal distributions
male_heights = statistics.NormalDist(175, 8)
female_heights = statistics.NormalDist(162, 7)

# Overlap between distributions
overlap = male_heights.overlap(female_heights)
print(f"Distribution overlap: {overlap:.1%}")

Performance Considerations

Time Complexity Summary

OperationTime ComplexityNotes
mean(), fmean()O(n)Single pass through data
median()O(n log n)Requires sorting
mode()O(n)Single pass with counting
stdev(), variance()O(n)Two passes: mean calculation + variance
quantiles()O(n log n)Requires sorting
correlation()O(n)Three passes through data

Basic Benchmarking

import statistics
import timeit

# Performance comparison for mean calculations
data = list(range(100000))

# Time different mean functions
mean_time = timeit.timeit(lambda: statistics.mean(data), number=100)
fmean_time = timeit.timeit(lambda: statistics.fmean(data), number=100)
builtin_sum_time = timeit.timeit(lambda: sum(data) / len(data), number=100)

print(f"statistics.mean(): {mean_time:.4f}s")
print(f"statistics.fmean(): {fmean_time:.4f}s") # Fastest
print(f"sum()/len(): {builtin_sum_time:.4f}s")

# Memory efficiency test
def memory_efficient_stats(data_generator):
"""Calculate stats without loading all data into memory."""
# For streaming data, accumulate values
total, count, values = 0, 0, []

for value in data_generator():
total += value
count += 1
values.append(value) # Store for median calculation

return {
'mean': total / count,
'median': statistics.median(values),
'count': count
}

Memory Usage Tips

  • Use fmean() instead of mean() for large floating-point datasets
  • For streaming data, calculate mean incrementally: total/count
  • median() requires storing all data; consider using quantiles() for large datasets
  • NormalDist uses minimal memory for probability calculations vs storing large samples

🎯 When to Use statistics

✅ Ideal Use Cases

  • Quick descriptive statistics: Built-in functions for common statistical measures
  • Data exploration: Initial analysis of datasets for insights and patterns
  • Quality control: Process monitoring and control chart calculations
  • A/B testing: Comparing groups with correlation and mean comparisons
  • Normal distribution modeling: Built-in NormalDist class for probability calculations
  • Educational purposes: Teaching statistical concepts with clear, readable code
  • Scientific computing: When you need accurate statistical calculations
  • Small to medium datasets: Efficient for datasets that fit in memory

❌ When NOT to Use statistics

  • Large-scale data analysis: Use NumPy/Pandas for better performance with large arrays
  • Advanced statistical tests: Use SciPy for hypothesis testing, regression analysis
  • Machine learning: Use scikit-learn for predictive modeling and advanced algorithms
  • Time series analysis: Use pandas or specialized libraries like statsmodels
  • Streaming data: Limited support for online/incremental calculations
  • Multi-dimensional data: No built-in support for matrices or multi-dimensional arrays
  • Complex probability distributions: Limited to normal distribution only

Alternative Solutions

  • NumPy: High-performance arrays and mathematical functions
  • Pandas: Data manipulation with built-in statistical methods
  • SciPy: Advanced statistical functions and hypothesis testing
  • scikit-learn: Machine learning algorithms and advanced statistics
  • statsmodels: Econometric and statistical modeling
  • Custom implementations: For specific requirements or streaming data

Additional Learning Resources

Official Python Resources (PRIMARY SOURCES)

Books and Publications

  • "Think Stats" by Allen B. Downey - Statistics and probability for programmers
  • "Statistics for Hackers" by Jake VanderPlas - Modern statistical analysis with Python
  • "Python for Data Analysis" by Wes McKinney - Comprehensive data analysis with pandas and NumPy
  • "Introduction to Statistical Learning with Python" - Statistical learning concepts with Python implementations

Online Tutorials and Courses

  • Real Python: Python statistics module tutorial - Comprehensive guide with examples
  • DataCamp: Python statistics courses - Interactive lessons with hands-on practice
  • Coursera: "Statistics with Python" specialization - University-level statistical analysis
  • Khan Academy: Statistics and probability - Foundation concepts with visual explanations

Practice and Examples

  • Kaggle Learn: Statistics course - Free micro-courses with real-world datasets
  • HackerRank: Statistics challenges - Programming problems involving statistical calculations
  • LeetCode: Math and statistics problems - Coding interview preparation
  • GitHub: Awesome Statistics - Curated list of statistical resources

Advanced Topics

  • SciPy Documentation: Statistical functions - Advanced statistical tests and distributions
  • NumPy Documentation: Mathematical functions - Array-based mathematical operations
  • Pandas Documentation: Computational tools - DataFrame statistical methods
  • Matplotlib/Seaborn: Data visualization for statistical analysis

Community Resources

  • r/statistics: Reddit community for statistical discussions and help
  • Cross Validated: Stack Exchange site for statistics questions
  • Python.org: SIG for Scientific Computing - Special interest group discussions
  • PyData: Global community for Python data science practitioners