I look into Ben Recht’s post questioning the existence of the “data-generating distribution” (DGD), derive an optimal predictor with/without it, and then go into how it’s the sampling mechanism which introduces stochasticity into ML (and not the DGD).


The two perpsectives on populations, sampling, and randomness. Diagram by Nano Banana Pro.

I recently ran into this post by Ben Recht that argues that the “data-generating distribution” (DGD), a ubiquitous assumption in machine learning literature, does not exist. Concretely, the DGD assumption states that a data point is simply a sample from a distribution . For example, we assume there exists a stochastic process that generated all the images of handwritten digits in the MNIST dataset. Under this assumption, the images in the dataset are simply samples drawn from this (unknown) distribution. However, Recht argues that 1) this distribution does not exist and 2) this distribution is not necessary to derive popular machine learning methods, e.g., batch gradient descent.

I’ll walk through Recht’s claim, then derive an optimal predictor with and without a DGD, and finally detail how randomness is introduced through our choice of sampling mechanism, and not from nature.

The DGD does not exist

Recht does not formally prove this claim, but he justifies it by arguing that even if the data generation process is messy and difficult to model, it is nevertheless deterministic. For example, Anthropic recently bought a bunch of old books, ripped them apart, scanned them, and used them as data to pretrain their LLM, Claude. Modeling how the authors wrote those books or how Anthropic selected them would be an incredibly complicated task, but that doesn’t mean it was a stochastic process.

Recht then explains how popular machine learning concepts like optimal predictors, generalization, and online learning can be built without the stochasticity of the DGD. He proposes focusing on three components: populations (the true set of all data points we could observe), metrics (how we measure performance on the population), and mechanisms (how samples are selected from the population).

Deriving the optimal predictor

First, let’s start by defining our terms.

  • A predictor is a function that takes in an input and returns a prediction of the corresponding output.
  • A loss is a function that maps a target and a prediction to a scalar cost.
  • The population is the set of all possible input/output pairs that we care about (let be the size of this set).
  • A metric is how we choose to apply the loss to the population
  • A mechanism is how we pick out samples from the population, in practice.

Perspective 1: With a DGD

In this traditional perspective, the population is defined by the support of the DGD, . The optimal predictor is the function that minimizes the loss over the unknown distribution . This is formally written as:

where is the expectation of the loss:

and if is the empirical distribution defined by the dataset

Perspective 2: Without a DGD

Here, we strip away the stochasticity. The population is no longer the support of an abstract distribution , but simply a concrete, finite set of size . This set represents all the data points we wish to act on. E.g., all the books ever written, or every single handwritten digit that has ever been scanned. While it may be impossible to actually collect the full set , the key point is that the set exists deterministically.

In this framework, we make an explicit design decision (i.e., our metric) on how to score the predictor over the population. If we choose our metric to be the average loss over the population, the population loss becomes:

The difference

The expressions for in both frameworks look mathematically identical. However, the source of the “average” is fundamentally different. In the DGD framework, we minimize an expectation because we assume nature generates data stochastically. In the distribution-free framework, we minimize an average because we chose the average as our scoring metric for the population. Recht argues that this latter approach allows us to define optimality without assuming nature acts like a random number generator.

Sampling Mechanisms

In practice, we do not have access to the whole population. Therefore, we must rely on sampling mechanisms to give us a subset of the data. This mechanism determines how we approximate the true population loss and as Ben argues, is where the randomness in ML comes from (rather than it being a property of the world).

Recht then goes into three different mechanisms:

1. Batch Learning

In batch learning, the mechanism is a one-time “grab” of a subset . Then the population loss is approximated as:

In the traditional view with an DGD, this is justified by saying that the data was generated i.i.d. and therefore the Law of Large Numbers guarantees that with a large enough sample size, the empirical average will converge to the true expected loss.

In Recht’s view, the data is fixed. Randomness is injected by the sampling mechanism that we design and use. And now if we choose to sample uniformly from the population the law of large numbers will hold. Note however that it will hold because of how WE chose to sample from the population, and not because of some assumption about how the data was created.

2. Online Learning

In online learning, we process the data sequentially. This is an interesting situation because it requires no randomness at all! In the traditional view we assume the stream is i.i.d. (again, a strong assumption). In Recht’s view, we don’t need to make any assumptions about the data if instead of approximating the average, we aim to minimize regret. Regret compares the cumulative loss against the best possible predictor we could’ve chosen in hindsight:

If regret grows sublinearly with time, then we can guarantee that our learning algorithm is performing as well as the best possible model in our model class regardless of how the data was generated! Again, no need for stochasticity or a data-generating distribution!

3. Empiricist Learning

In this mechanism, you assume the population is fixed (no DGD), but you have the power to act on the population to select samples. And because you know how you’re selecting these samples, you know the probability of selection exactly. You don’t need to assume nature is i.i.d. if you made the data i.i.d. by the way you designed the experiment. An example: consider the average price of books in a library using a sample size of 10 books.

  • In the DGD view we are given 10 books and immediately assume they are i.i.d. samples from this distribution.
  • In Recht’s view we instead randomly select 10 books from the list of all books registry and then check the price. In the second case, randomness came only from our selection mechanism rather than the data. And because we know exactly how we randomly selected the books, we can derive guarantees for how biased and/or confident learning will be BEFORE we even run an experiment.

Note that all of this assumes that we can execute our sampling mechanism of choice. If we are just given the books, with no notion of how easy it was to pick out each book, we lose the capacity to make these ex ante guarantees.

Conclusion

Recht’s argument offers an interesting reframing of ML foundations, even if the immediate algorithmic shifts aren’t yet obvious (to me, at least). I think I’ll probably dive deeper into the math details by looking at his full course on ML (what a time to be alive when such resources are freely available!!), because all this sounds very relevant to my own research on online learning, and experiment design applied to model-based RL.


Blogpost 18/100