Rohin Shah at CHAI also recently developed a proposal to produce a demonstration of an inner alignment failure that shares many similarities with my proposal here.
The proposal is here. I’m not planning to work on it soon, so if someone else would like to, please do (but let me know so we don’t duplicate work).
Similarities:
1. A focus on inner alignment failures, and particularly the definition of “capabilities generalize, objective doesn’t”
2. Identification of “many possible proxy rewards” as the key problem, and also that it isn’t particularly interesting / surprising to construct environments where there are multiple identical rewards
3. The focus on the two components of environment design and an architecture that can lead to capability generalization
Differences:
1. I am less optimistic that existing architectures will learn general capabilities on simple environments (though they might do so on complex ones if you have enough data). I think you’ll need something like the SATNet (see proposal for details). This could be a difference in what experiments we’re imagining—I think it is more likely than not that we see either intended generalization or just capability generalization for the maze example using an RNN, if the only thing you change at test time is where the arrow is. However, if you make the maze larger, then I think it is more likely than not that you get a complete generalization failure (though this depends on how exactly you encoded the maze for your RNN to consume).
2. I care a decent amount about producing an example that is convincing to ML researchers, and so I want a much more realistic environment, like Minecraft.
3. I don’t agree that inner alignment failures need a “model which is doing some sort of internal search such that it actually has an objective that can fail to generalize”. It’s possible to learn heuristics that don’t look like search that nonetheless capability-generalize well. However, if you want, you can take the intentional stance towards the model in order to infer a behavioral objective.
4. I am less interested in figuring out which behavioral objectives neural nets tend to learn: there are so many potential behavioral objectives that I think it is basically inevitable that regular training leads to the wrong one; it doesn’t seem that helpful to know which wrong ones it leads to. You could hope that this would help us “look into” a neural net to find the inner objective, but per disagreement 3 I don’t expect this to work. (I’m more optimistic about looking for particular scenarios where the model optimizes for the wrong thing.) This is not to say I think such an empirical investigation would be useless—it is the sort of thing that is likely to produce insights we didn’t see coming—just that it seems less useful than other research one could be doing.
Side question: How can I easily make properly-formatted numbered lists?
The proposal is here. I’m not planning to work on it soon, so if someone else would like to, please do (but let me know so we don’t duplicate work).
Similarities:
1. A focus on inner alignment failures, and particularly the definition of “capabilities generalize, objective doesn’t”
2. Identification of “many possible proxy rewards” as the key problem, and also that it isn’t particularly interesting / surprising to construct environments where there are multiple identical rewards
3. The focus on the two components of environment design and an architecture that can lead to capability generalization
Differences:
1. I am less optimistic that existing architectures will learn general capabilities on simple environments (though they might do so on complex ones if you have enough data). I think you’ll need something like the SATNet (see proposal for details). This could be a difference in what experiments we’re imagining—I think it is more likely than not that we see either intended generalization or just capability generalization for the maze example using an RNN, if the only thing you change at test time is where the arrow is. However, if you make the maze larger, then I think it is more likely than not that you get a complete generalization failure (though this depends on how exactly you encoded the maze for your RNN to consume).
2. I care a decent amount about producing an example that is convincing to ML researchers, and so I want a much more realistic environment, like Minecraft.
3. I don’t agree that inner alignment failures need a “model which is doing some sort of internal search such that it actually has an objective that can fail to generalize”. It’s possible to learn heuristics that don’t look like search that nonetheless capability-generalize well. However, if you want, you can take the intentional stance towards the model in order to infer a behavioral objective.
4. I am less interested in figuring out which behavioral objectives neural nets tend to learn: there are so many potential behavioral objectives that I think it is basically inevitable that regular training leads to the wrong one; it doesn’t seem that helpful to know which wrong ones it leads to. You could hope that this would help us “look into” a neural net to find the inner objective, but per disagreement 3 I don’t expect this to work. (I’m more optimistic about looking for particular scenarios where the model optimizes for the wrong thing.) This is not to say I think such an empirical investigation would be useless—it is the sort of thing that is likely to produce insights we didn’t see coming—just that it seems less useful than other research one could be doing.
Side question: How can I easily make properly-formatted numbered lists?