Spent the afternoon reading “Martingale Posterior Neural Networks for Fast Sequential Decision Making” (Duran-Martin et al., 2025) which focuses on algorithms for online learning of neural net parameters used for Bayesian decision making. My understanding in a few sentences: they use a low-rank Kalman filter to obtain a distribution over neural net parameters and model predictions. Then they use the predictive distribution for sequential decision making by sampling from it (instead of the parameter posterior, as in classical Thompson sampling). They call this “predictive sampling” and they use the Martingale posterior framework as a way of bridging the gap between their frequentist parameter update and the Bayesian-like goal of decision making under uncertainty. Finally, they show their method on a variety of online learning examples such as online MNIST, a recommender algorithm, and Bayesian optimization.
Here’s my summary of some of the most important ideas:
Problem statement
Consider an environment that returns a reward (or observation) based on an input
Given a dataset
where
The objective of the paper is to obtain a predictive distribution of the predictive distribution
Online learning with a Kalman filter
The authors use an extended Kalman filter (EKF) to compute the statistics for a Gaussian distribution over the parameters
A vanilla EKF does not scale to large networks because the memory cost scales quadratically (due to the square covariance matrix) and cubically with parameter dimension (due to the filter’s matrix inversions). Therefore, they propose a variety of low-rank approximations for the covariance matrix which make their approach feasible for modern neural networks.
For example, their algorithm HiLoFi uses a low-rank approximation for the covariance of the hidden layers and a full-rank last layer covariance. Another example is LRKF which uses a low-rank approximation for the covariance of all the parameters.
I won’t go into the gory details of the math, but the important parts are that
- The Kalman filter approach provides statistics to define a parameter distribution, which is then propagated into a predictive distribution by linearizing the neural network and matching moments.
- Their low-rank approximations allow them to use large networks while maintaining tractability.
Decision making and martingales.
The authors then introduce a novel approach for sequential decision making that avoids sampling from the high-dimensional parameter posterior
This is justified with a “posterior-first” perspective which, as I understand it (and I don’t understand it very well, yet), uses martingales to justify how working only with the posterior predictive (which is much lower-dimension than the parameter distribution) is sufficient for decision making. This is the part of the paper I’m least sure about so I apologize for being vague here.
Final thoughts
Overall, very cool paper. I’m very interested in online learning and I’ve implemented with EKFs and neural networks so this is very relevant. I’m still unsure how the martingales justify using the posterior predictive for decision making. The authors also cite work on mis-specified BNN priors and likelihoods, which I want to look into. I also wonder if this framework could be extended to multi-step predictions for model-based control.
This is blogpost 16/100