16  [Model Fitting and Generalized Linear Model GLM Families and Link Functions Figure 10.3: Generalized Linear Models extend linear regression to different response types. Each distribution family has an appropriate link function that connects the linear predictor to the mean of the distribution. GLMs unify regression for continuous, binary, count, and positive continuous data.
Generalized Linear Models]{.chapter-title}

NoteLearning Objectives

By the end of this chapter, you will be able to:

  • Understand the philosophy and workflow of model fitting in neuroscience
  • Implement linear regression and generalized linear models (GLMs) from scratch
  • Apply GLMs to neural data, particularly spike train analysis
  • Evaluate model performance using cross-validation and information criteria
  • Combat overfitting

The Bias-Variance Tradeoff Figure 10.1: The fundamental bias-variance tradeoff in statistical modeling. Underfitting (high bias) fails to capture patterns. Overfitting (high variance) captures noise. The goal is to find the sweet spot that generalizes well to new data.

overfitting using regularization techniques (L1, L2) - Connect statistical models to neural encoding and decoding problems

16.1 10.1 The Philosophy of Model Fitting

Figure 16.1: Statistical mapping reveals brain activity patterns, with GLM analysis showing regions responding to experimental conditions.

At its core, model fitting is the process of adjusting a mathematical model’s parameters so that it can best describe a set of data. It’s a fundamental practice in both neuroscience and AI, allowing us to formalize our hypotheses about how a system works and to test those hypotheses against real-world observations. This chapter builds on the data science foundations introduced in Chapter 8.

The general workflow of model fitting involves several key steps:

  1. Choosing a Model: This is the most critical step. The model is a mathematical representation of your hypothesis about the data-generating process. It could be a simple linear equation, a complex dynamical system, or a deep neural network. The choice of model should be guided by your scientific question and your prior knowledge of the system.

  2. Choosing a Cost Function: The cost function (also known as a loss function or objective function) is a measure of how well the model fits the data. It quantifies the discrepancy between the model’s predictions and the actual data. The goal of the fitting process is to find the model parameters that minimize this cost function.

  3. Fitting the Model: This is the optimization step, where we use an algorithm to find the parameters that minimize the cost function. For simple models, this can sometimes be done analytically, but for more complex models, we typically rely on numerical optimization methods.

  4. Evaluating the Model: Once the model is fit, we need to evaluate its performance. This involves not only looking at how well it fits the training data but also, crucially, how well it generalizes to new, unseen data. This is where techniques like cross-validation become essential.

The Bias-Variance Tradeoff

A fundamental concept in model fitting is the bias-variance tradeoff:

  • Bias: Error from overly simplistic assumptions. High bias → underfitting
  • Variance: Error from sensitivity to fluctuations in training data. High variance → overfitting
  • Goal: Find the sweet spot that minimizes total error

16.2 10.2 Linear Regression: A Simple Starting Point

The simplest and most common type of model is the linear regression model. In this model, we assume that the output variable is a linear function of the input variables, plus some noise.

Mathematical Formulation:

\[ y = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2) \]

Where: - \(y\) is the response variable (e.g., neural firing rate) - \(\mathbf{X}\) is the design matrix (input features)

The design matrix encodes experimental conditions and timing, forming the foundation of statistical analysis. - \(\boldsymbol{\beta}\) are the coefficients we want to estimate - \(\boldsymbol{\epsilon}\) is Gaussian noise

Code Lab: Linear Regression from Scratch

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

def linear_regression_from_scratch():
    """
    Implement linear regression using the closed-form solution.

    Application: Model a neuron's firing rate as a function of stimulus intensity.
    """
    np.random.seed(42)

    # Generate synthetic neural data
    # Scenario: Neuron responds linearly to stimulus with some noise
    n_trials = 200
    stimulus_intensity = np.linspace(0, 10, n_trials)

    # True relationship: firing_rate = 2 + 3 * stimulus + noise
    true_baseline = 2.0
    true_gain = 3.0
    noise = np.random.randn(n_trials) * 1.5

    firing_rate = true_baseline + true_gain * stimulus_intensity + noise

    # Prepare design matrix (add intercept)
    X = np.column_stack([np.ones(n_trials), stimulus_intensity])
    y = firing_rate

    # Closed-form solution: β = (X^T X)^{-1} X^T y
    beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y

    # Predictions
    y_pred = X @ beta_hat

    # Calculate R²
    ss_res = np.sum((y - y_pred) ** 2)
    ss_tot = np.sum((y - y.mean()) ** 2)
    r_squared = 1 - (ss_res / ss_tot)

    # Visualization
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.scatter(stimulus_intensity, firing_rate, alpha=0.5,
               label='Observed data', color='#cc0000')
    ax.plot(stimulus_intensity, y_pred, 'b-', linewidth=2,
            label=f'Fitted model: y = {beta_hat[0]:.2f} + {beta_hat[1]:.2f}x')
    ax.plot(stimulus_intensity, true_baseline + true_gain * stimulus_intensity,
            'g--', linewidth=2, label=f'True model: y = {true_baseline} + {true_gain}x')

    ax.set_xlabel('Stimulus Intensity')
    ax.set_ylabel('Firing Rate (Hz)')
    ax.set_title(f'Linear Regression: Neural Encoding Model (R² = {r_squared:.3f})')
    ax.legend()
    ax.grid(alpha=0.3)

    plt.tight_layout()

    print(f"True parameters: baseline={true_baseline}, gain={true_gain}")
    print(f"Estimated parameters: baseline={beta_hat[0]:.2f}, gain={beta_hat[1]:.2f}")
    print(f"R² = {r_squared:.3f}")

    return beta_hat, r_squared

# linear_regression_from_scratch()

While simple, linear regression is a powerful tool, and it forms the foundation for many more complex models. It’s also a great way to introduce the core concepts of model fitting, such as the design matrix, the cost function (in this case, the mean squared error), and the process of finding the optimal parameters.

Figure 16.2: The hemodynamic response function shapes how neural activity appears in fMRI, rising and falling over seconds after stimulation.

16.3 10.3 Generalized Linear Models (GLMs)

Linear regression is a great tool, but it has its limitations. For one, it assumes that the noise in the data is Gaussian. This might be a reasonable assumption for some types of data, like fMRI BOLD signals, but it’s not a good assumption for other types of data, like spike counts, which are always non-negative integers.

This is where Generalized Linear Models (GLMs) come in. GLMs are a flexible framework that extends linear regression to handle a wider range of data types and noise distributions.

The GLM Framework

The GLM framework consists of three components:

  1. A linear predictor: \(\eta = \mathbf{X}\boldsymbol{\beta}\)
  2. A link function: \(g(\mu) = \eta\), which relates the mean response to the linear predictor
  3. A noise model (distribution family): e.g., Gaussian, Poisson, Bernoulli

Common GLM Specifications:

Application Noise Distribution Link Function Example
Continuous data Gaussian Identity: \(\mu = \eta\) fMRI BOLD
Count data (spikes) Poisson Log: \(\log(\mu) = \eta\) Neural spikes
Binary outcomes Bernoulli Logit: \(\log(\frac{\mu}{1-\mu}) = \eta\) Choice data

Code Lab: Poisson GLM for Spike Trains

The most common application of GLMs in neuroscience is modeling spike counts using a Poisson distribution.

def poisson_glm_spike_trains():
    """
    Fit a Poisson GLM to model neural spike counts.

    Scenario: Model how a neuron's spike count depends on stimulus features.
    """
    np.random.seed(42)

    # Generate synthetic spike data
    n_trials = 300

    # Two stimulus features: orientation (0-180°) and contrast (0-1)
    orientation = np.random.rand(n_trials) * 180
    contrast = np.random.rand(n_trials)

    # True relationship (log-linear)
    # log(λ) = β0 + β1 * orientation + β2 * contrast
    beta_true = np.array([0.5, 0.01, 2.0])

    X = np.column_stack([np.ones(n_trials), orientation, contrast])
    log_rate = X @ beta_true
    rate = np.exp(log_rate)  # Inverse link (exponential)

    # Generate Poisson spike counts
    spike_counts = np.random.poisson(rate)

    # Fit Poisson GLM using iteratively reweighted least squares (IRLS)
    # For simplicity, we'll use scikit-learn
    from sklearn.linear_model import PoissonRegressor

    model = PoissonRegressor(alpha=0, max_iter=1000)
    X_features = X[:, 1:]  # Exclude intercept (model adds it automatically)
    model.fit(X_features, spike_counts)

    # Extract parameters
    beta_hat = np.array([model.intercept_, *model.coef_])

    # Predictions
    rate_pred = np.exp(X @ beta_hat)

    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Plot 1: Observed vs Predicted spike counts
    axes[0].scatter(rate, spike_counts, alpha=0.4, c='#cc0000', label='Observed')
    max_rate = max(rate.max(), rate_pred.max())
    axes[0].plot([0, max_rate], [0, max_rate], 'k--', label='Perfect prediction')
    axes[0].set_xlabel('True Rate (λ)')
    axes[0].set_ylabel('Observed Spike Count')
    axes[0].set_title('Poisson GLM: Observed vs True Rate')
    axes[0].legend()
    axes[0].grid(alpha=0.3)

    # Plot 2: Effect of contrast on firing rate
    contrast_range = np.linspace(0, 1, 100)
    orientation_fixed = 90  # Fix orientation at 90°

    X_pred = np.column_stack([
        np.ones(100),
        np.full(100, orientation_fixed),
        contrast_range
    ])
    rate_pred_contrast = np.exp(X_pred @ beta_hat)

    axes[1].plot(contrast_range, rate_pred_contrast, 'b-', linewidth=2,
                 label='Predicted')
    axes[1].set_xlabel('Stimulus Contrast')
    axes[1].set_ylabel('Predicted Firing Rate (Hz)')
    axes[1].set_title(f'Effect of Contrast (orientation={orientation_fixed}°)')
    axes[1].grid(alpha=0.3)
    axes[1].legend()

    plt.tight_layout()

    print("True parameters:", beta_true)
    print("Estimated parameters:", beta_hat)
    print(f"Parameter recovery error: {np.linalg.norm(beta_true - beta_hat):.3f}")

    return beta_hat

# poisson_glm_spike_trains()

Neuroscience Insight: Poisson GLMs are the workhorse model for analyzing spike trains. They allow us to: - Model how neurons encode stimulus features (encoding model) - Decode stimulus information from neural activity (decoding) - Quantify tuning curves (e.g., orientation tuning in V1)

Maximum Likelihood Estimation

GLMs are typically fit using maximum likelihood estimation (MLE). The idea is to find parameters that maximize the probability of observing the data.

Log-likelihood for Poisson GLM:

\[ \ell(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_i \log(\lambda_i) - \lambda_i - \log(y_i!) \right] \]

where \(\lambda_i = \exp(\mathbf{x}_i^T \boldsymbol{\beta})\) is the predicted rate.

16.4 10.4 Overfitting and Regularization

A major challenge in model fitting is overfitting. This occurs when a model learns the training data too well, capturing not only the underlying signal but also the random noise. An overfit model will perform very well on the data it was trained on, but it will fail to generalize to new, unseen data.

Cross-Validation

The best way to detect overfitting is through cross-validation. Instead of training the model on all of the available data, we split the data into a training set and a testing set. We fit the model on the training set and then evaluate its performance on the testing set. If the model performs much better on the training set than on the testing set, it is likely overfit.

def demonstrate_cross_validation():
    """
    Demonstrate k-fold cross-validation for model evaluation.
    """
    from sklearn.model_selection import cross_val_score
    from sklearn.linear_model import Ridge
    from sklearn.preprocessing import PolynomialFeatures

    np.random.seed(42)

    # Generate data
    n = 100
    X = np.linspace(0, 10, n).reshape(-1, 1)
    y = 2 + 3 * X.ravel() + 0.5 * X.ravel()**2 + np.random.randn(n) * 5

    # Try different polynomial degrees
    degrees = [1, 2, 3, 5, 10, 15]
    train_scores = []
    cv_scores = []

    for degree in degrees:
        # Create polynomial features
        poly = PolynomialFeatures(degree=degree, include_bias=False)
        X_poly = poly.fit_transform(X)

        # Fit model
        model = Ridge(alpha=0.1)
        model.fit(X_poly, y)

        # Training score
        train_score = model.score(X_poly, y)
        train_scores.append(train_score)

        # 5-fold cross-validation score
        cv_score = cross_val_score(model, X_poly, y, cv=5,
                                   scoring='r2').mean()
        cv_scores.append(cv_score)

    # Plot
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.plot(degrees, train_scores, 'o-', label='Training R²',
            color='#0066cc', linewidth=2)
    ax.plot(degrees, cv_scores, 's-', label='Cross-Validation R²',
            color='#cc0000', linewidth=2)
    ax.axhline(y=max(cv_scores), color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel('Polynomial Degree')
    ax.set_ylabel('R² Score')
    ax.set_title('Overfitting Detection with Cross-Validation')
    ax.legend()
    ax.grid(alpha=0.3)

    optimal_degree = degrees[np.argmax(cv_scores)]
    print(f"Optimal polynomial degree: {optimal_degree}")
    print(f"Best CV R²: {max(cv_scores):.3f}")

    plt.tight_layout()
    return degrees, train_scores, cv_scores

# demonstrate_cross_validation()

Regularization: L1 and L2

One of the most effective ways to combat overfitting is regularization. Regularization is a technique that adds a penalty term to the cost function, discouraging the model from learning overly complex or extreme parameter values.

L2 Regularization (Ridge Regression):

\[ \text{Cost} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \]

This method adds a penalty proportional to the square of the magnitude of the model’s weights. It encourages the model to use all of the input features but to keep the weights small.

L1 Regularization (Lasso Regression):

\[ \text{Cost} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \]

This method adds a penalty proportional to the absolute value of the weights. This has the effect of forcing some of the weights to be exactly zero, effectively performing a kind of automatic feature selection.

def compare_regularization():
    """
    Compare L1 (Lasso) and L2 (Ridge) regularization.
    """
    from sklearn.linear_model import Ridge, Lasso

    np.random.seed(42)

    # Generate data with many irrelevant features
    n_samples = 100
    n_features = 50
    n_informative = 5  # Only 5 features are actually relevant

    X = np.random.randn(n_samples, n_features)

    # True coefficients (only first 5 are non-zero)
    beta_true = np.zeros(n_features)
    beta_true[:n_informative] = [3, -2, 4, -1, 2.5]

    y = X @ beta_true + np.random.randn(n_samples) * 0.5

    # Fit models
    ridge = Ridge(alpha=1.0)
    lasso = Lasso(alpha=0.1, max_iter=10000)

    ridge.fit(X, y)
    lasso.fit(X, y)

    # Visualization
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    # True coefficients
    axes[0].bar(range(n_features), beta_true, color='gray', alpha=0.7)
    axes[0].set_title('True Coefficients - (5 non-zero)')
    axes[0].set_xlabel('Feature Index')
    axes[0].set_ylabel('Coefficient Value')

    # Ridge coefficients
    axes[1].bar(range(n_features), ridge.coef_, color='#0066cc', alpha=0.7)
    axes[1].set_title('Ridge (L2) Regularization - (All features shrunk)')
    axes[1].set_xlabel('Feature Index')
    axes[1].set_ylabel('Coefficient Value')

    # Lasso coefficients
    axes[2].bar(range(n_features), lasso.coef_, color='#cc0000', alpha=0.7)
    axes[2].set_title(f'Lasso (L1) Regularization - ({np.sum(lasso.coef_ != 0)} non-zero features)')
    axes[2].set_xlabel('Feature Index')
    axes[2].set_ylabel('Coefficient Value')

    for ax in axes:
        ax.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
        ax.grid(alpha=0.3, axis='y')

    plt.tight_layout()

    print(f"Ridge: {np.sum(ridge.coef_ != 0)} non-zero coefficients")
    print(f"Lasso: {np.sum(lasso.coef_ != 0)} non-zero coefficients")
    print(f"True model: {n_informative} non-zero coefficients")

    return ridge.coef_, lasso.coef_

# compare_regularization()

Key Differences: - Ridge (L2): Shrinks all coefficients, but none to exactly zero. Good when all features contribute somewhat. - Lasso (L1): Drives some coefficients to exactly zero. Good for feature selection when many features are irrelevant.

Bayesian Perspective on Regularization

From a Bayesian perspective, regularization can be seen as imposing a prior on the model’s parameters:

  • L2 regularization ≡ Gaussian prior: \(\beta_j \sim \mathcal{N}(0, 1/\lambda)\)
  • L1 regularization ≡ Laplace prior: \(p(\beta_j) \propto \exp(-\lambda|\beta_j|)\)

This provides an elegant connection between frequentist regularization and Bayesian inference.

16.5 10.5 Model Selection: AIC

Model Selection with AIC/BIC Figure 10.2: Model selection using information criteria. Training error always decreases with complexity, but test error has a minimum. AIC and BIC balance model fit with complexity penalties, with BIC penalizing complexity more heavily. The optimal model generalizes best to new data.

AIC and BIC

How do we choose between models of different complexity? Information criteria provide a principled way to balance fit quality with model complexity.

Akaike Information Criterion (AIC):

\[ \text{AIC} = 2k - 2\ln(\hat{L}) \]

Bayesian Information Criterion (BIC):

\[ \text{BIC} = k\ln(n) - 2\ln(\hat{L}) \]

Where: - \(k\) = number of parameters - \(n\) = number of observations - \(\hat{L}\) = maximum likelihood

Lower is better. BIC penalizes complexity more heavily than AIC.

def model_selection_demo():
    """
    Demonstrate model selection using AIC and BIC.
    """
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.linear_model import LinearRegression

    np.random.seed(42)

    # Generate data (true model is quadratic)
    n = 80
    X = np.linspace(0, 10, n).reshape(-1, 1)
    y = 2 + 3*X.ravel() - 0.5*X.ravel()**2 + np.random.randn(n) * 5

    # Try different polynomial degrees
    degrees = range(1, 8)
    aic_scores = []
    bic_scores = []

    for degree in degrees:
        # Fit model
        poly = PolynomialFeatures(degree=degree, include_bias=True)
        X_poly = poly.fit_transform(X)
        model = LinearRegression()
        model.fit(X_poly, y)

        # Calculate residuals
        y_pred = model.predict(X_poly)
        residuals = y - y_pred
        rss = np.sum(residuals**2)

        # Log-likelihood (assuming Gaussian noise)
        n = len(y)
        sigma_sq = rss / n
        log_likelihood = -n/2 * np.log(2 * np.pi * sigma_sq) - rss / (2 * sigma_sq)

        # Number of parameters (including intercept)
        k = degree + 1

        # AIC and BIC
        aic = 2*k - 2*log_likelihood
        bic = k*np.log(n) - 2*log_likelihood

        aic_scores.append(aic)
        bic_scores.append(bic)

    # Plot
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.plot(list(degrees), aic_scores, 'o-', label='AIC',
            color='#0066cc', linewidth=2, markersize=8)
    ax.plot(list(degrees), bic_scores, 's-', label='BIC',
            color='#cc0000', linewidth=2, markersize=8)

    best_aic = degrees[np.argmin(aic_scores)]
    best_bic = degrees[np.argmin(bic_scores)]

    ax.axvline(x=best_aic, color='#0066cc', linestyle='--', alpha=0.5,
               label=f'Best AIC (degree={best_aic})')
    ax.axvline(x=best_bic, color='#cc0000', linestyle='--', alpha=0.5,
               label=f'Best BIC (degree={best_bic})')

    ax.set_xlabel('Polynomial Degree')
    ax.set_ylabel('Information Criterion (lower is better)')
    ax.set_title('Model Selection: AIC vs BIC')
    ax.legend()
    ax.grid(alpha=0.3)

    plt.tight_layout()

    print(f"Best model by AIC: degree {best_aic}")
    print(f"Best model by BIC: degree {best_bic}")
    print(f"True model: quadratic (degree 2)")

    return aic_scores, bic_scores

# model_selection_demo()

16.6 Exercises

Conceptual Questions

  1. Linear Regression Assumptions: What are the key assumptions of linear regression? For which types of neuroscience data would these assumptions be violated?

  2. GLM vs Linear Regression: Explain why a Poisson GLM is more appropriate than linear regression for modeling spike counts. What would go wrong if you used linear regression?

  3. Regularization Intuition: In your own words, explain how L1 and L2 regularization prevent overfitting. When would you prefer one over the other?

Computational Problems

  1. Implement GLM from Scratch: Implement a Poisson GLM using gradient descent (not using sklearn). Test it on synthetic spike train data.

  2. Cross-Validation: Generate synthetic data where the true model is a 3rd-degree polynomial. Use k-fold cross-validation to determine the optimal polynomial degree. Plot train vs validation error.

  3. Feature Selection: Generate data with 20 features, where only 3 are truly informative. Use Lasso regression to perform automatic feature selection. Compare your results to using all features with Ridge regression.

  4. Neural Encoding Model: Load a real neural dataset (e.g., from CRCNS.org) and fit a GLM to predict spike counts from stimulus features. Evaluate your model using cross-validation.

Discussion Questions

  1. Model Complexity: Discuss the trade-off between model interpretability and predictive power. When might you prefer a simple linear model over a more complex nonlinear model in neuroscience?

  2. Bayesian vs Frequentist: Compare the frequentist approach (point estimates, confidence intervals) to the Bayesian approach (posterior distributions) for model fitting. What are the advantages of each?

  3. Real-World Application: Design a GLM-based experiment to test whether a neuron in motor cortex encodes movement direction, speed, or both. What would your design matrix look like? How would you test your hypothesis?

16.7 References

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer.

McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman and Hall/CRC.

Dobson, A. J., & Barnett, A. G. (2018). An Introduction to Generalized Linear Models (4th ed.). CRC Press.

Truccolo, W., Eden, U. T., Fellows, M. R., Donoghue, J. P., & Brown, E. N. (2005). A point process framework for relating neural spiking activity to spiking history, neural ensemble, and extrinsic covariate effects. Journal of Neurophysiology, 93(2), 1074-1089.

Paninski, L. (2004). Maximum likelihood estimation of cascade point-process neural encoding models. Network: Computation in Neural Systems, 15(4), 243-262.

Pillow, J. W., Shlens, J., Paninski, L., et al. (2008). Spatio-temporal correlations and visual signalling in a complete neuronal population. Nature, 454(7207), 995-999.

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

Gelman, A., Carlin, J. B., Stern, H. S., et al. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.