I’m pretty on board with this research agenda, but I’m curious what you think about the distinction between approaches that look like finding a fixed point, and approaches that look like doing perturbation theory.
There are a couple different directions to go from here. One way is to try to collapse the recursion. Find a single agent-shaped model of humans that is (or approximates) a fixed point of this model-ratification process (and also hopefully stays close to real humans by some metric), and use the preferences of that. This is what I see as the endgame of the imitation / bootstrapping research.
Another way might be to imitate communication, and find a way to use recursive models such that we can stop the recursion early without much loss in effectiveness. In communication, the innermost layer of the model can be quite simplistic, and then the next is more complicated by virtue of taking advantage of the first, and so on. At each layer you can do some amount of abstracting away of the details of previous layers, so by the time you’re at layer 4 maybe it doesn’t matter that layer 1 was just a crude facsimile of human behavior.
Thinking specifically about this UTAA monad thing, I think it’s a really clever way to think about what levers we have access to in the fixed-point picture. (If I was going to point out one thing it’s lacking, it’s that it’s a little hazy on whether you’re supposed to model V as now having meta-values about the state of the entire recursive tree of UTAAs, or whether your Q function is now supposed to learn about meta-preferences from some outside data source.) But it retains the things I’m worried about from this fixed-point picture, which is basically that I’m not sure it buys us much of anything if the starting point isn’t benign in a quite strong sense.
I’m pretty on board with this research agenda, but I’m curious what you think about the distinction between approaches that look like finding a fixed point, and approaches that look like doing perturbation theory.
Ah, nice post, sorry I didn’t see it originally! It’s pointing at a very very related idea.
With respect to your question about fixed points, I think the issue is quite complicated, and I’d rather approach it indirectly by collecting criteria and trying to make models which fit the various criteria. But here are some attempted thoughts.
We should be quite skeptical of just taking a fixed point, without carefully building up all the elements of the final solution—we don’t just want consistency, we want consistency as a result of sufficiently humanlike deliberation. This is similar to the idea that naive infinite HCH might be malign (because it’s just some weird fixed point of humans-consulting-HCH), but if we ensure that the HCH tree is finite by (a) requiring all queries to have a recursion budget, or (b) having a probability of randomly stopping (not allowing the tree to be expanded any further), or things like that, we can avoid weird fixed points (and, not coincidentally, these models fit better with what you’d get from iterated amplification if you’re training it carefully rather than in a way which allows weird malign fixed-points to creep in).
However, I still may want to take fixed points in the design; for example, the way UTAAs allow me to collapse all the meta-levels down. A big difference between your approach in the post and mine here is that I’ve got more separation between the rationality criteria of the design vs the rationality the system is going to learn, so I can use pure fixed points on one but not the other (hopefully that makes sense?). The system can be based on a perfect fixed point of some sort, while still building up a careful picture iteratively improving on initial models. That’s kind of illustrated by the recursive quantilization idea. The output is supposed to come from an actual fixed-point of quantilizing UTAAs, but it can also be seen as the result of successive layers. (Though overall I think we probably don’t get enough of the “carefully building up incremental improvements” spirit.)
(If I was going to point out one thing it’s lacking, it’s that it’s a little hazy on whether you’re supposed to model V as now having meta-values about the state of the entire recursive tree of UTAAs, or whether your Q function is now supposed to learn about meta-preferences from some outside data source.)
Agreed, I was totally lazy about this. I might write something more detailed in the future, but this felt like an OK version to get the rough ideas out. After all, I think there are bigger issues than this (IE the two desiderata failures I pointed out at the end).
I’m pretty on board with this research agenda, but I’m curious what you think about the distinction between approaches that look like finding a fixed point, and approaches that look like doing perturbation theory.
And on the assumption that you have no idea what I’m referring to, here’s the link to my post.
Thinking specifically about this UTAA monad thing, I think it’s a really clever way to think about what levers we have access to in the fixed-point picture. (If I was going to point out one thing it’s lacking, it’s that it’s a little hazy on whether you’re supposed to model V as now having meta-values about the state of the entire recursive tree of UTAAs, or whether your Q function is now supposed to learn about meta-preferences from some outside data source.) But it retains the things I’m worried about from this fixed-point picture, which is basically that I’m not sure it buys us much of anything if the starting point isn’t benign in a quite strong sense.
Ah, nice post, sorry I didn’t see it originally! It’s pointing at a very very related idea.
Seems like it also has to do with John’s communication model.
With respect to your question about fixed points, I think the issue is quite complicated, and I’d rather approach it indirectly by collecting criteria and trying to make models which fit the various criteria. But here are some attempted thoughts.
We should be quite skeptical of just taking a fixed point, without carefully building up all the elements of the final solution—we don’t just want consistency, we want consistency as a result of sufficiently humanlike deliberation. This is similar to the idea that naive infinite HCH might be malign (because it’s just some weird fixed point of humans-consulting-HCH), but if we ensure that the HCH tree is finite by (a) requiring all queries to have a recursion budget, or (b) having a probability of randomly stopping (not allowing the tree to be expanded any further), or things like that, we can avoid weird fixed points (and, not coincidentally, these models fit better with what you’d get from iterated amplification if you’re training it carefully rather than in a way which allows weird malign fixed-points to creep in).
However, I still may want to take fixed points in the design; for example, the way UTAAs allow me to collapse all the meta-levels down. A big difference between your approach in the post and mine here is that I’ve got more separation between the rationality criteria of the design vs the rationality the system is going to learn, so I can use pure fixed points on one but not the other (hopefully that makes sense?). The system can be based on a perfect fixed point of some sort, while still building up a careful picture iteratively improving on initial models. That’s kind of illustrated by the recursive quantilization idea. The output is supposed to come from an actual fixed-point of quantilizing UTAAs, but it can also be seen as the result of successive layers. (Though overall I think we probably don’t get enough of the “carefully building up incremental improvements” spirit.)
Agreed, I was totally lazy about this. I might write something more detailed in the future, but this felt like an OK version to get the rough ideas out. After all, I think there are bigger issues than this (IE the two desiderata failures I pointed out at the end).