I look into Ben Recht’s post questioning the existence of the “data-generating distribution” (DGD), derive an optimal predictor with/without it, and then go into how it’s the sampling mechanism which introduces stochasticity into ML (and not the DGD).
I recently ran into this post by Ben Recht that argues that the “data-generating distribution” (DGD), a ubiquitous assumption in machine learning literature, does not exist.
Concretely, the DGD assumption states that a data point
I’ll walk through Recht’s claim, then derive an optimal predictor with and without a DGD, and finally detail how randomness is introduced through our choice of sampling mechanism, and not from nature.
The DGD does not exist
Recht does not formally prove this claim, but he justifies it by arguing that even if the data generation process is messy and difficult to model, it is nevertheless deterministic. For example, Anthropic recently bought a bunch of old books, ripped them apart, scanned them, and used them as data to pretrain their LLM, Claude. Modeling how the authors wrote those books or how Anthropic selected them would be an incredibly complicated task, but that doesn’t mean it was a stochastic process.
Recht then explains how popular machine learning concepts like optimal predictors, generalization, and online learning can be built without the stochasticity of the DGD. He proposes focusing on three components: populations (the true set of all data points we could observe), metrics (how we measure performance on the population), and mechanisms (how samples are selected from the population).
Deriving the optimal predictor
First, let’s start by defining our terms.
- A predictor is a function
that takes in an input and returns a prediction of the corresponding output. - A loss
is a function that maps a target and a prediction to a scalar cost. - The population is the set
of all possible input/output pairs that we care about (let be the size of this set). - A metric is how we choose to apply the loss to the population
- A mechanism is how we pick out samples from the population, in practice.
Perspective 1: With a DGD
In this traditional perspective, the population is defined by the support of the DGD,
where
and if
Perspective 2: Without a DGD
Here, we strip away the stochasticity.
The population is no longer the support of an abstract distribution
In this framework, we make an explicit design decision (i.e., our metric) on how to score the predictor over the population. If we choose our metric to be the average loss over the population, the population loss becomes:
The difference
The expressions for
Sampling Mechanisms
In practice, we do not have access to the whole population.
Therefore, we must rely on sampling mechanisms to give us a subset of the data.
This mechanism determines how we approximate the true population loss
Recht then goes into three different mechanisms:
1. Batch Learning
In batch learning, the mechanism is a one-time “grab” of a subset
In the traditional view with an DGD, this is justified by saying that the data was generated i.i.d. and therefore the Law of Large Numbers guarantees that with a large enough sample size, the empirical average will converge to the true expected loss.
In Recht’s view, the data is fixed. Randomness is injected by the sampling mechanism that we design and use. And now if we choose to sample uniformly from the population the law of large numbers will hold. Note however that it will hold because of how WE chose to sample from the population, and not because of some assumption about how the data was created.
2. Online Learning
In online learning, we process the data sequentially. This is an interesting situation because it requires no randomness at all! In the traditional view we assume the stream is i.i.d. (again, a strong assumption). In Recht’s view, we don’t need to make any assumptions about the data if instead of approximating the average, we aim to minimize regret. Regret compares the cumulative loss against the best possible predictor we could’ve chosen in hindsight:
If regret grows sublinearly with time, then we can guarantee that our learning algorithm is performing as well as the best possible model in our model class
3. Empiricist Learning
In this mechanism, you assume the population is fixed (no DGD), but you have the power to act on the population to select samples. And because you know how you’re selecting these samples, you know the probability of selection exactly. You don’t need to assume nature is i.i.d. if you made the data i.i.d. by the way you designed the experiment. An example: consider the average price of books in a library using a sample size of 10 books.
- In the DGD view we are given 10 books and immediately assume they are i.i.d. samples from this distribution.
- In Recht’s view we instead randomly select 10 books from the list of all books registry and then check the price. In the second case, randomness came only from our selection mechanism rather than the data. And because we know exactly how we randomly selected the books, we can derive guarantees for how biased and/or confident learning will be BEFORE we even run an experiment.
Note that all of this assumes that we can execute our sampling mechanism of choice. If we are just given the books, with no notion of how easy it was to pick out each book, we lose the capacity to make these ex ante guarantees.
Conclusion
Recht’s argument offers an interesting reframing of ML foundations, even if the immediate algorithmic shifts aren’t yet obvious (to me, at least). I think I’ll probably dive deeper into the math details by looking at his full course on ML (what a time to be alive when such resources are freely available!!), because all this sounds very relevant to my own research on online learning, and experiment design applied to model-based RL.
Blogpost 18/100