Why Statsmodels?

Essential Python library for statistical modeling, hypothesis testing, and rigorous data analysis across economics, social sciences, and advanced research domains.

Statistical Modeling

Build robust regression and estimation models

Hypothesis Testing

Conduct rigorous statistical inference

Time Series

Analyze temporal patterns and forecasting

Data Exploration

Visualize and understand data relationships

Get Started in VSCode

Watch the setup tutorial and follow the installation steps to start building statistical models.

Complete walkthrough covering Python setup, VSCode configuration, and statsmodels installation.

1

Prerequisites

Python 3.8+, VSCode with Python extension, pip installed

2

Install Core Package

pip install statsmodels
3

Add Dependencies

pip install numpy pandas matplotlib scipy
4

Verify Setup

python -c "import statsmodels"

Used In

Economics, Social Sciences, Finance

Documentation

statsmodels.org

Python Version

3.8+

Complete documentation for regression and linear models from statsmodels

Linear Regression (OLS)

Ordinary Least Squares regression for fitting linear models to data with independently and identically distributed errors.

Mathematical Foundation
Y = Xβ + ε, where ε ~ N(0, Σ)

Parameters

endog
array_like (required)
1-d endogenous response variable (dependent variable).
exog
array_like (required)
nobs x k array of regressors. Intercept not included automatically.
missing
str (default: 'none')
How to handle NaN values: 'none', 'drop', or 'raise'.

Core Methods

fit() predict() f_test() t_test() wald_test() get_influence() outlier_test() cov_params()

Code Example: Basic OLS

Python
import numpy as np
import statsmodels.api as sm

# Generate sample data
x = np.linspace(0, 10, 100)
X = sm.add_constant(np.column_stack((x, x**2)))
y = np.dot(X, [1, 0.1, 10]) + np.random.normal(size=100)

# Fit OLS model
results = sm.OLS(y, X).fit()
print(results.summary())

Model Properties

rsquared
R-squared measure of model fit
aic
Akaike Information Criterion
bic
Bayesian Information Criterion
fvalue
F-statistic of model
condition_number
Multicollinearity detection
resid
Model residuals
Important Note
Always use sm.add_constant() to add an intercept term. Check for multicollinearity using condition_number (values > 20 indicate issues).

Generalized Linear Models (GLM)

Flexible regression framework for exponential family distributions with customizable link functions.

Model Specification
μᵢ = E[Yᵢ|xᵢ] = g⁻¹(x'ᵢβ)
7 Distribution Families 14 Link Functions IRWLS Estimation

Binomial

For binary or proportional response data

Domain: {0, 1, ..., n}
v(μ) = μ - μ²/n

Poisson

For count data

Domain: {0, 1, ..., ∞}
v(μ) = μ

Gaussian

For continuous normal data

Domain: (-∞, ∞)
v(μ) = 1

Gamma

For positive continuous data

Domain: (0, ∞)
v(μ) = μ²

Inverse Gaussian

For positive skewed data

Domain: (0, ∞)
v(μ) = μ³

Negative Binomial

For overdispersed counts

v(μ) = μ + αμ²

Code Example: GLM with Gamma Family

Python
import statsmodels.formula.api as smf

# Fit Gamma GLM
model = smf.glm('y ~ x1 + x2', data=df,
                 family=sm.families.Gamma(link=sm.families.links.log()))
results = model.fit()
print(results.summary())

Diagnostic Plots

Residuals vs Fitted: Scatter plot showing residuals against fitted values to assess homoscedasticity.

Q-Q Plot: Quantile-quantile plot comparing residual distribution to theoretical normal distribution.

Generalized Estimating Equations (GEE)

Marginal regression for panel, cluster, or repeated measures data with correlated observations within clusters.

Marginal Model
E[Yᵢⱼ|Xᵢⱼ] = g⁻¹(X'ᵢⱼβ)

Covariance Structures

Independence
Uncorrelated observations within clusters.
Exchangeable
Constant correlation: Corr(Yᵢⱼ, Yᵢₖ) = ρ
Autoregressive
Exponential decay: Corr(Yᵢⱼ, Yᵢₖ) = ρ^|j-k|

Key Features

Robust Inference
Sandwich estimator provides valid inference even with misspecified correlation.
Panel Data
Designed for clustered and longitudinal data structures.

Code Example: GEE with Exchangeable Correlation

Python
import statsmodels.api as sm

fam = sm.families.Poisson()
ind = sm.cov_struct.Exchangeable()

model = sm.GEE.from_formula('y ~ x', groups='id',
                               data=data, cov_struct=ind, family=fam)
results = model.fit()
print(results.summary())

Robust Linear Models (RLM)

M-estimators for regression resistant to outliers using iteratively reweighted least squares.

Objective Function
min Σ ρ(eᵢ/s)

HuberT (Default)

Quadratic for small residuals, linear for large.

ρ(r) = r²/2 if |r| ≤ c, else c|r| - c²/2

RamsayE

Exponential downweighting of outliers.

ρ(r) = 1 - exp(-r²/2a²)

TukeyBiweight

Completely downweights beyond threshold.

ρ(r) = (1-(r/c)²)³ if |r| ≤ c, else 0

Code Example: RLM with Huber Norm

Python
import statsmodels.api as sm

# Fit robust model
rlm_model = sm.RLM(y, X, M=sm.robust.norms.HuberT())
rlm_results = rlm_model.fit()

# Compare with OLS
ols_results = sm.OLS(y, X).fit()
print(rlm_results.summary())

Linear Mixed Effects Models (MixedLM)

Hierarchical models with both fixed and random effects for clustered or longitudinal data.

Mixed Model Framework
Y = Xβ + Zu + ε

Random Intercepts

Each group has its own baseline: Yᵢⱼ = (β₀ + uᵢ) + β₁Xᵢⱼ + εᵢⱼ

Random Slopes

Groups have different intercepts and slopes: Yᵢⱼ = (β₀ + u₀ᵢ) + (β₁ + u₁ᵢ)Xᵢⱼ + εᵢⱼ

Code Example: Random Intercepts

Python
import statsmodels.formula.api as smf

md = smf.mixedlm("y ~ x", data, groups=data["group_id"])
mdf = md.fit()
print(mdf.summary())

Discrete Dependent Variable Models

Maximum likelihood estimation for binary, multinomial, count, and ordinal outcomes.

Binary Models

Logit
P(y=1|x) = 1/(1+exp(-x'β))
Uses logistic CDF. Log-odds interpretation.
Probit
P(y=1|x) = Φ(x'β)
Uses normal CDF. Lighter tails.

Count Models

Poisson
For count data with mean equals variance.
Negative Binomial
For overdispersed counts (variance > mean).
Zero-Inflated
Mixture model for excess zeros.

Code Example: Logit with Marginal Effects

Python
import statsmodels.api as sm

logit_model = sm.Logit(y, X)
results = logit_model.fit()

# Calculate marginal effects
margeff = results.get_margeff(at='mean')
print(margeff.summary())

Analysis of Variance (ANOVA)

Statistical tests for differences among group means and variance partitioning.

ANOVA Framework
SST = SSM + SSE
One-Way & Multi-Way Repeated Measures Type I, II, III SS

Code Example: Two-Way ANOVA

Python
import statsmodels.api as sm
from statsmodels.formula.api import ols

model = ols('y ~ C(factor1) * C(factor2)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

Quantile Regression

Estimate conditional quantiles of the response variable for distributional analysis.

Quantile Function
Qτ(Y|X) = X'βτ
Any Quantile τ ∈ (0,1) Robust to Outliers

Code Example: Multiple Quantiles

Python
import statsmodels.formula.api as smf

quantiles = [0.1, 0.25, 0.5, 0.75, 0.9]
for q in quantiles:
    model = smf.quantreg('y ~ x', data)
    results = model.fit(q=q)
    print(f"τ={q}: {results.params['x']}")

Quantile Regression Visualization

Scatter plot with multiple regression lines at different quantiles (τ=0.1, 0.25, 0.5, 0.75, 0.9) showing how relationships vary across the distribution. Median line (τ=0.5) in bold.

Recursive Least Squares

State space model for time-varying parameters using Kalman filter framework.

State Space Framework
βt = βt-1 + ηt
Time-Varying Parameters Kalman Filter CUSUM Tests

Code Example: Time-Varying Coefficients

Python
import statsmodels.api as sm

model = sm.RecursiveLS(y, X)
results = model.fit()

# Plot recursive coefficients
results.plot_recursive_coefficient(0, alpha=0.05)
results.plot_recursive_coefficient(1, alpha=0.05)

Recursive Parameter Evolution

Intercept: Time series plot showing estimated intercept over time with 95% confidence bands.

Slope: Evolution of slope coefficient with widening confidence bands during uncertainty periods. CUSUM test included for stability.