I’m pretty on board with this research agenda, but I’m curious what you think about the distinction between approaches that look like finding a fixed point, and approaches that look like doing perturbation theory.
Ah, nice post, sorry I didn’t see it originally! It’s pointing at a very very related idea.
With respect to your question about fixed points, I think the issue is quite complicated, and I’d rather approach it indirectly by collecting criteria and trying to make models which fit the various criteria. But here are some attempted thoughts.
We should be quite skeptical of just taking a fixed point, without carefully building up all the elements of the final solution—we don’t just want consistency, we want consistency as a result of sufficiently humanlike deliberation. This is similar to the idea that naive infinite HCH might be malign (because it’s just some weird fixed point of humans-consulting-HCH), but if we ensure that the HCH tree is finite by (a) requiring all queries to have a recursion budget, or (b) having a probability of randomly stopping (not allowing the tree to be expanded any further), or things like that, we can avoid weird fixed points (and, not coincidentally, these models fit better with what you’d get from iterated amplification if you’re training it carefully rather than in a way which allows weird malign fixed-points to creep in).
However, I still may want to take fixed points in the design; for example, the way UTAAs allow me to collapse all the meta-levels down. A big difference between your approach in the post and mine here is that I’ve got more separation between the rationality criteria of the design vs the rationality the system is going to learn, so I can use pure fixed points on one but not the other (hopefully that makes sense?). The system can be based on a perfect fixed point of some sort, while still building up a careful picture iteratively improving on initial models. That’s kind of illustrated by the recursive quantilization idea. The output is supposed to come from an actual fixed-point of quantilizing UTAAs, but it can also be seen as the result of successive layers. (Though overall I think we probably don’t get enough of the “carefully building up incremental improvements” spirit.)
(If I was going to point out one thing it’s lacking, it’s that it’s a little hazy on whether you’re supposed to model V as now having meta-values about the state of the entire recursive tree of UTAAs, or whether your Q function is now supposed to learn about meta-preferences from some outside data source.)
Agreed, I was totally lazy about this. I might write something more detailed in the future, but this felt like an OK version to get the rough ideas out. After all, I think there are bigger issues than this (IE the two desiderata failures I pointed out at the end).
Ah, nice post, sorry I didn’t see it originally! It’s pointing at a very very related idea.
Seems like it also has to do with John’s communication model.
With respect to your question about fixed points, I think the issue is quite complicated, and I’d rather approach it indirectly by collecting criteria and trying to make models which fit the various criteria. But here are some attempted thoughts.
We should be quite skeptical of just taking a fixed point, without carefully building up all the elements of the final solution—we don’t just want consistency, we want consistency as a result of sufficiently humanlike deliberation. This is similar to the idea that naive infinite HCH might be malign (because it’s just some weird fixed point of humans-consulting-HCH), but if we ensure that the HCH tree is finite by (a) requiring all queries to have a recursion budget, or (b) having a probability of randomly stopping (not allowing the tree to be expanded any further), or things like that, we can avoid weird fixed points (and, not coincidentally, these models fit better with what you’d get from iterated amplification if you’re training it carefully rather than in a way which allows weird malign fixed-points to creep in).
However, I still may want to take fixed points in the design; for example, the way UTAAs allow me to collapse all the meta-levels down. A big difference between your approach in the post and mine here is that I’ve got more separation between the rationality criteria of the design vs the rationality the system is going to learn, so I can use pure fixed points on one but not the other (hopefully that makes sense?). The system can be based on a perfect fixed point of some sort, while still building up a careful picture iteratively improving on initial models. That’s kind of illustrated by the recursive quantilization idea. The output is supposed to come from an actual fixed-point of quantilizing UTAAs, but it can also be seen as the result of successive layers. (Though overall I think we probably don’t get enough of the “carefully building up incremental improvements” spirit.)
Agreed, I was totally lazy about this. I might write something more detailed in the future, but this felt like an OK version to get the rough ideas out. After all, I think there are bigger issues than this (IE the two desiderata failures I pointed out at the end).