I believe Chris has now updated the Towards Data Science blog post to be more clear, based on the conversation you had, but I’ll make some clarifications here as well, for the benefit of others:
The key claim, that Pβ(f∣S)≈Popt(f∣S), is indeed not (meant to be) dependent on any test data per se. The test data comes into the picture because we need a way to granuralise the space of possible functions if we want to compare these two quantities empirically. If we take “the space of functions” to be all the functions that a given neural network can express on the entire vector space in which it is defined, then there would be an uncountably infinite number of such functions, and any given function would never show up more than once in any kind of experiment we could do. We therefore need a way to lump the functions together into sensible buckets, and we decided to do that by looking at what output the function gives on a set of images not used in training. Stated differently, we look at the partial function that the network expresses on a particular subset of the input vector space, consisting in a bunch of points sampled from the underlying data distribution. So, basically, your optimistic guess is correct—the test data is only used as a way to lump functions together into a finite number of sensible buckets (and the test labels are not used).
I believe Chris has now updated the Towards Data Science blog post to be more clear, based on the conversation you had, but I’ll make some clarifications here as well, for the benefit of others:
The key claim, that Pβ(f∣S)≈Popt(f∣S), is indeed not (meant to be) dependent on any test data per se. The test data comes into the picture because we need a way to granuralise the space of possible functions if we want to compare these two quantities empirically. If we take “the space of functions” to be all the functions that a given neural network can express on the entire vector space in which it is defined, then there would be an uncountably infinite number of such functions, and any given function would never show up more than once in any kind of experiment we could do. We therefore need a way to lump the functions together into sensible buckets, and we decided to do that by looking at what output the function gives on a set of images not used in training. Stated differently, we look at the partial function that the network expresses on a particular subset of the input vector space, consisting in a bunch of points sampled from the underlying data distribution. So, basically, your optimistic guess is correct—the test data is only used as a way to lump functions together into a finite number of sensible buckets (and the test labels are not used).