Stats Models | Evening Watch

Linear Regression (OLS)

Ordinary Least Squares regression for fitting linear models to data with independently and identically distributed errors.

Mathematical Foundation

Y = Xβ + ε, where ε ~ N(0, Σ)

Parameters

endog

array_like (required)

1-d endogenous response variable (dependent variable).

exog

array_like (required)

nobs x k array of regressors. Intercept not included automatically.

missing

str (default: 'none')

How to handle NaN values: 'none', 'drop', or 'raise'.

Core Methods

fit() predict() f_test() t_test() wald_test() get_influence() outlier_test() cov_params()

Code Example: Basic OLS

Python

import numpy as np
import statsmodels.api as sm

# Generate sample data
x = np.linspace(0, 10, 100)
X = sm.add_constant(np.column_stack((x, x**2)))
y = np.dot(X, [1, 0.1, 10]) + np.random.normal(size=100)

# Fit OLS model
results = sm.OLS(y, X).fit()
print(results.summary())

Model Properties

rsquared

R-squared measure of model fit

aic

Akaike Information Criterion

bic

Bayesian Information Criterion

fvalue

F-statistic of model

condition_number

Multicollinearity detection

resid

Model residuals

Important Note

Always use sm.add_constant() to add an intercept term. Check for multicollinearity using condition_number (values > 20 indicate issues).

Generalized Linear Models (GLM)

Flexible regression framework for exponential family distributions with customizable link functions.

Model Specification

μᵢ = E[Yᵢ|xᵢ] = g⁻¹(x'ᵢβ)

7 Distribution Families 14 Link Functions IRWLS Estimation

Binomial

For binary or proportional response data

Domain: {0, 1, ..., n}

v(μ) = μ - μ²/n

Poisson

For count data

Domain: {0, 1, ..., ∞}

v(μ) = μ

Gaussian

For continuous normal data

Domain: (-∞, ∞)

v(μ) = 1

Gamma

For positive continuous data

Domain: (0, ∞)

v(μ) = μ²

Inverse Gaussian

For positive skewed data

Domain: (0, ∞)

v(μ) = μ³

Negative Binomial

For overdispersed counts

v(μ) = μ + αμ²

Code Example: GLM with Gamma Family

Python

import statsmodels.formula.api as smf

# Fit Gamma GLM
model = smf.glm('y ~ x1 + x2', data=df,
                 family=sm.families.Gamma(link=sm.families.links.log()))
results = model.fit()
print(results.summary())

Diagnostic Plots

Residuals vs Fitted: Scatter plot showing residuals against fitted values to assess homoscedasticity.

Q-Q Plot: Quantile-quantile plot comparing residual distribution to theoretical normal distribution.

Generalized Estimating Equations (GEE)

Marginal regression for panel, cluster, or repeated measures data with correlated observations within clusters.

Marginal Model

E[Yᵢⱼ|Xᵢⱼ] = g⁻¹(X'ᵢⱼβ)

Covariance Structures

Independence

Uncorrelated observations within clusters.

Exchangeable

Constant correlation: Corr(Yᵢⱼ, Yᵢₖ) = ρ

Autoregressive

Exponential decay: Corr(Yᵢⱼ, Yᵢₖ) = ρ^|j-k|

Key Features

Robust Inference

Sandwich estimator provides valid inference even with misspecified correlation.

Panel Data

Designed for clustered and longitudinal data structures.

Code Example: GEE with Exchangeable Correlation

Python

import statsmodels.api as sm

fam = sm.families.Poisson()
ind = sm.cov_struct.Exchangeable()

model = sm.GEE.from_formula('y ~ x', groups='id',
                               data=data, cov_struct=ind, family=fam)
results = model.fit()
print(results.summary())

Robust Linear Models (RLM)

M-estimators for regression resistant to outliers using iteratively reweighted least squares.

Objective Function

min Σ ρ(eᵢ/s)

HuberT (Default)

Quadratic for small residuals, linear for large.

ρ(r) = r²/2 if |r| ≤ c, else c|r| - c²/2

RamsayE

Exponential downweighting of outliers.

ρ(r) = 1 - exp(-r²/2a²)

TukeyBiweight

Completely downweights beyond threshold.

ρ(r) = (1-(r/c)²)³ if |r| ≤ c, else 0

Code Example: RLM with Huber Norm

Python

import statsmodels.api as sm

# Fit robust model
rlm_model = sm.RLM(y, X, M=sm.robust.norms.HuberT())
rlm_results = rlm_model.fit()

# Compare with OLS
ols_results = sm.OLS(y, X).fit()
print(rlm_results.summary())

Linear Mixed Effects Models (MixedLM)

Hierarchical models with both fixed and random effects for clustered or longitudinal data.

Mixed Model Framework

Y = Xβ + Zu + ε

Random Intercepts

Each group has its own baseline: Yᵢⱼ = (β₀ + uᵢ) + β₁Xᵢⱼ + εᵢⱼ

Random Slopes

Groups have different intercepts and slopes: Yᵢⱼ = (β₀ + u₀ᵢ) + (β₁ + u₁ᵢ)Xᵢⱼ + εᵢⱼ

Code Example: Random Intercepts

Python

import statsmodels.formula.api as smf

md = smf.mixedlm("y ~ x", data, groups=data["group_id"])
mdf = md.fit()
print(mdf.summary())

Discrete Dependent Variable Models

Maximum likelihood estimation for binary, multinomial, count, and ordinal outcomes.

Binary Models

Logit

P(y=1|x) = 1/(1+exp(-x'β))

Uses logistic CDF. Log-odds interpretation.

Probit

P(y=1|x) = Φ(x'β)

Uses normal CDF. Lighter tails.

Count Models

Poisson

For count data with mean equals variance.

Negative Binomial

For overdispersed counts (variance > mean).

Zero-Inflated

Mixture model for excess zeros.

Code Example: Logit with Marginal Effects

Python

import statsmodels.api as sm

logit_model = sm.Logit(y, X)
results = logit_model.fit()

# Calculate marginal effects
margeff = results.get_margeff(at='mean')
print(margeff.summary())

Analysis of Variance (ANOVA)

Statistical tests for differences among group means and variance partitioning.

ANOVA Framework

SST = SSM + SSE

One-Way & Multi-Way Repeated Measures Type I, II, III SS

Code Example: Two-Way ANOVA

Python

import statsmodels.api as sm
from statsmodels.formula.api import ols

model = ols('y ~ C(factor1) * C(factor2)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

Quantile Regression

Estimate conditional quantiles of the response variable for distributional analysis.

Quantile Function

Qτ(Y|X) = X'βτ

Any Quantile τ ∈ (0,1) Robust to Outliers

Code Example: Multiple Quantiles

Python

import statsmodels.formula.api as smf

quantiles = [0.1, 0.25, 0.5, 0.75, 0.9]
for q in quantiles:
    model = smf.quantreg('y ~ x', data)
    results = model.fit(q=q)
    print(f"τ={q}: {results.params['x']}")

Quantile Regression Visualization

Scatter plot with multiple regression lines at different quantiles (τ=0.1, 0.25, 0.5, 0.75, 0.9) showing how relationships vary across the distribution. Median line (τ=0.5) in bold.

Recursive Least Squares

State space model for time-varying parameters using Kalman filter framework.

State Space Framework

βt = βt-1 + ηt

Time-Varying Parameters Kalman Filter CUSUM Tests

Code Example: Time-Varying Coefficients

Python

import statsmodels.api as sm

model = sm.RecursiveLS(y, X)
results = model.fit()

# Plot recursive coefficients
results.plot_recursive_coefficient(0, alpha=0.05)
results.plot_recursive_coefficient(1, alpha=0.05)

Recursive Parameter Evolution

Intercept: Time series plot showing estimated intercept over time with 95% confidence bands.

Slope: Evolution of slope coefficient with widening confidence bands during uncertainty periods. CUSUM test included for stability.

Comprehensive Statistical Models Library

Why Statsmodels?

Statistical Modeling

Hypothesis Testing

Time Series

Data Exploration

Get Started in VSCode

Prerequisites

Install Core Package

Add Dependencies

Verify Setup

Linear Regression (OLS)

Parameters

Core Methods

Code Example: Basic OLS

Model Properties

Generalized Linear Models (GLM)

Binomial

Poisson

Gaussian

Gamma

Inverse Gaussian

Negative Binomial

Code Example: GLM with Gamma Family

Diagnostic Plots

Generalized Estimating Equations (GEE)

Covariance Structures

Key Features

Code Example: GEE with Exchangeable Correlation

Robust Linear Models (RLM)

HuberT (Default)

RamsayE

TukeyBiweight

Code Example: RLM with Huber Norm

Linear Mixed Effects Models (MixedLM)

Random Intercepts

Random Slopes

Code Example: Random Intercepts

Discrete Dependent Variable Models

Binary Models

Count Models

Code Example: Logit with Marginal Effects

Analysis of Variance (ANOVA)

Code Example: Two-Way ANOVA

Quantile Regression

Code Example: Multiple Quantiles

Quantile Regression Visualization

Recursive Least Squares

Code Example: Time-Varying Coefficients

Recursive Parameter Evolution