I agree that trying to align an AGI entirely by behavior (as in RL) is unlikely to generalize adequately after extensive learning (which will be necessary) and in dramatically new contexts (which seem inevitable).
There are alternatives which have not been analyzed or discussed much, yet:
I think this class of approach might be the “fundamental advance” you’re calling for. These approaches haven’t gotten enough attention to be in the common consciousness, so I doubt the authors were considering them in their “default path”.
I think these approaches are all fairly obvious if you’re building the relevant types of AGI—which people are currently working on. So I think the default path to AGI might well include these approaches, which don’t define goals through behavior. That might well change the default path from ending in failure. I’m currently optimistic but not at all sure until GSLK approaches get more analysis.
Yeah specifying goals in a learned ontology does seem better to me, and in my opinion is a much better approach than behavioral training. But there’s a couple of major roadblocks that come to mind:
You need really insanely good interpretability on the learned ontology.
You need to be so good at specifying goals in that ontology that they are robust to adversarial optimization.
Work on these problems is great. I particularly like John’s work on natural latent variables which seems like the sort of thing that might be useful for the first two of these.
Keep in mind though there are other major problems that this approach doesn’t help much with, e.g.:
Standard problems arising from the ontology changing over time or being optimized against.
The problem of ensuring that no subpart of your agent is pursuing different goals (or applying optimization in a way that may break the overall system at some point).
I largely agree. I think you don’t need any those things to have a shot, but you do to be certain.
To your point 1, I think you can reduce the need for very precise interpretability if you make the alignment target simpler. I wrote about this a little here but there’s a lot more to be said and analyzed. That might help with RL techniques too.
If you believe in natural latent variables, which I tend to, those should help with the stability problem you mention.
WRT subagents having different goals, you do need to design it so the primary goals are dominant. Which would be tricky to be certain of. I’d hope q self aware and introspective agent could help enforce that.
I agree that trying to align an AGI entirely by behavior (as in RL) is unlikely to generalize adequately after extensive learning (which will be necessary) and in dramatically new contexts (which seem inevitable).
There are alternatives which have not been analyzed or discussed much, yet:
Goals selected from learned knowledge: an alternative to RL alignment
I think this class of approach might be the “fundamental advance” you’re calling for. These approaches haven’t gotten enough attention to be in the common consciousness, so I doubt the authors were considering them in their “default path”.
I think these approaches are all fairly obvious if you’re building the relevant types of AGI—which people are currently working on. So I think the default path to AGI might well include these approaches, which don’t define goals through behavior. That might well change the default path from ending in failure. I’m currently optimistic but not at all sure until GSLK approaches get more analysis.
Yeah specifying goals in a learned ontology does seem better to me, and in my opinion is a much better approach than behavioral training.
But there’s a couple of major roadblocks that come to mind:
You need really insanely good interpretability on the learned ontology.
You need to be so good at specifying goals in that ontology that they are robust to adversarial optimization.
Work on these problems is great. I particularly like John’s work on natural latent variables which seems like the sort of thing that might be useful for the first two of these.
Keep in mind though there are other major problems that this approach doesn’t help much with, e.g.:
Standard problems arising from the ontology changing over time or being optimized against.
The problem of ensuring that no subpart of your agent is pursuing different goals (or applying optimization in a way that may break the overall system at some point).
I largely agree. I think you don’t need any those things to have a shot, but you do to be certain.
To your point 1, I think you can reduce the need for very precise interpretability if you make the alignment target simpler. I wrote about this a little here but there’s a lot more to be said and analyzed. That might help with RL techniques too.
If you believe in natural latent variables, which I tend to, those should help with the stability problem you mention.
WRT subagents having different goals, you do need to design it so the primary goals are dominant. Which would be tricky to be certain of. I’d hope q self aware and introspective agent could help enforce that.