This is really exciting to me. If this work generalizes to future models capable of sophisticated planning in the real world, we will be able to forecast future actions that internally justify an AI’s current actions and thus tell whether they’re planning to coup or not, whether or not an explicit general-purpose representation of the objective exists.
Good point, explicit representations of the objective might not be as crucial for safety applications as my post frames it.
That said, some reasons this might not generalize in a way that enables this kind of application:
I think this type of look-ahead/search is especially favored in chess, and it might not be as important in at least some domains in which we’d want to understand the model’s cognition.
Our results are on a very narrow subset of board states (“tactically complex” ones). We already start with a filtered set of “puzzles” instead of general states, and then use only 2.5% of those. Anecdotally, the mechanisms we found are much less prevalent in random states.
I do think there’s an argument that these “tactically complex” states are the most interesting ones. But on the other hand, a lot of Leela’s playing strength comes from making very good decisions in “normal” states, which accumulate over the course of a game.
Chess has an extremely simple “world model” with clearly defined states and actions. And we know exactly what that world model is, so it’s easy-ish to look for relevant representations inside the network. I’d expect everything is just much messier for networks using models of the real world.
We have ground truth for the “correct” reason for any given move (using chess engines much stronger than the Leela network by itself). And in fact, we try to create an input distribution where we have reason to believe that we know what future line Leela is considering; then we train probes on this dataset (among other techniques). In a realistic scenario, we might not have any examples where we know for sure why the AI took an action.
I don’t think our understanding of Leela is good enough to enable these kinds of applications. For example, pretend we were trying to figure out whether Leela is really “trying” to win at chess, or whether it’s actually pursuing some other objective that happens to correlate pretty well with winning. (This admittedly isn’t a perfect analogy for planning a coup.) I don’t think our results so far would have told us.
I’m reasonably optimistic that we could get there though in the specific case of Leela, with a lot of additional work.
This is really exciting to me. If this work generalizes to future models capable of sophisticated planning in the real world, we will be able to forecast future actions that internally justify an AI’s current actions and thus tell whether they’re planning to coup or not, whether or not an explicit general-purpose representation of the objective exists.
Good point, explicit representations of the objective might not be as crucial for safety applications as my post frames it.
That said, some reasons this might not generalize in a way that enables this kind of application:
I think this type of look-ahead/search is especially favored in chess, and it might not be as important in at least some domains in which we’d want to understand the model’s cognition.
Our results are on a very narrow subset of board states (“tactically complex” ones). We already start with a filtered set of “puzzles” instead of general states, and then use only 2.5% of those. Anecdotally, the mechanisms we found are much less prevalent in random states.
I do think there’s an argument that these “tactically complex” states are the most interesting ones. But on the other hand, a lot of Leela’s playing strength comes from making very good decisions in “normal” states, which accumulate over the course of a game.
Chess has an extremely simple “world model” with clearly defined states and actions. And we know exactly what that world model is, so it’s easy-ish to look for relevant representations inside the network. I’d expect everything is just much messier for networks using models of the real world.
We have ground truth for the “correct” reason for any given move (using chess engines much stronger than the Leela network by itself). And in fact, we try to create an input distribution where we have reason to believe that we know what future line Leela is considering; then we train probes on this dataset (among other techniques). In a realistic scenario, we might not have any examples where we know for sure why the AI took an action.
I don’t think our understanding of Leela is good enough to enable these kinds of applications. For example, pretend we were trying to figure out whether Leela is really “trying” to win at chess, or whether it’s actually pursuing some other objective that happens to correlate pretty well with winning. (This admittedly isn’t a perfect analogy for planning a coup.) I don’t think our results so far would have told us.
I’m reasonably optimistic that we could get there though in the specific case of Leela, with a lot of additional work.