Thesis: one of the biggest alignment obstacles is that we often think of the utility function as being basically-local, e.g. that each region has a goodness score and we’re summing the goodness over all the regions. This basically-guarantees that there is an optimal pattern for a local region, and thus that the global optimum is just a tiling of that local optimal pattern.
Even if one adds a preference for variation, this likely just means that a distribution of patterns is optimal, and the global optimum will be a tiling of samples from said distribution.
The trouble is, to get started it seems like we would need narrow down the class of functions to have some structure that we can use to get going and make sense of these things. But what would be some general yet still nontrivial structure we could want?
Thesis: one of the biggest alignment obstacles is that we often think of the utility function as being basically-local, e.g. that each region has a goodness score and we’re summing the goodness over all the regions. This basically-guarantees that there is an optimal pattern for a local region, and thus that the global optimum is just a tiling of that local optimal pattern.
Even if one adds a preference for variation, this likely just means that a distribution of patterns is optimal, and the global optimum will be a tiling of samples from said distribution.
The trouble is, to get started it seems like we would need narrow down the class of functions to have some structure that we can use to get going and make sense of these things. But what would be some general yet still nontrivial structure we could want?