This was a great read. Thanks in particular for sharing some introspection on motivation and thinking processes leading to these findings!
Two thoughts:
First, I sense that you’re somewhat dissatisfied with using total variation distance (‘average action probability change’) as a qualitative measure of the impact of an intervention on behaviour. In particular, it doesn’t weight ‘meaningfulness’, and important changes might get washed out by lots of small changes in unimportant cells. When we visualise, I think we intuitively do something richer, but in order to test at scale, visualisation becomes a bottleneck, so you need something quantitative like this. Perhaps you might get some mileage by considering the stationary distribution of the policy-induced Markov chain? It can be approximated by multiplying the transition matrix by itself a few times! Obviously that matrix is technically quadratic size in state count, but it’s also very sparse :) so that might be relatively tractable given that you’ve already computed a NN forward pass for each state by to get to this point. Or you could eigendecompose the transition matrix.
Second, this seems well-informed to me, but I can’t really see the connection to (my understanding of) shard theory here, other than it being Team Shard! Maybe that’ll be clearer in a later post.
Second, this seems well-informed to me, but I can’t really see the connection to (my understanding of) shard theory here, other than it being Team Shard! Maybe that’ll be clearer in a later post.
Mostly in a later post. Ultimately, shard theory makes claims about goal/value formation in agents. In particular, some shard-theory flavored claims are:
That agents will have multiple, contextually activated goals and values
That we can predict what goals will be activated by considering what historical reinforcement events pertain to a given situation (e.g. is the cheese near the top-right corner, or not?)
That the multiple goals are each themselves made out of small pieces/circuits called “subshards” which can be separately manipulated or activated or influenced (see e.g. channels 55 and 42 having different effects when intervened upon)
So—we looked for “shards”, and (I think) found them.
That it’s profitable to think of agents as having multiple contextual goals, instead of thinking of them as “optimizing for a fixed objective”
(I would not have tried this project or its interventions if not for shard theory, and found shard theory reasoning very helpful throughout the project, and have some sense of having cut to empirical truths more quickly because of that theory. But I haven’t yet done deep credit assignment on this question. I think a more careful credit assignment will come down to looking at my preregistered predictions and reasoning.)
That we can predict what goals agents will form by considering their reinforcement schedules,
And we should gain skill at this art, today, now, in current systems. It seems like a clear alignment win to be able to loosely predict what goals/generalization behavior will be produced by a training process.
There are probably more ties I haven’t thought of. But hopefully this gives a little context!
This was a great read. Thanks in particular for sharing some introspection on motivation and thinking processes leading to these findings!
Two thoughts:
First, I sense that you’re somewhat dissatisfied with using total variation distance (‘average action probability change’) as a qualitative measure of the impact of an intervention on behaviour. In particular, it doesn’t weight ‘meaningfulness’, and important changes might get washed out by lots of small changes in unimportant cells. When we visualise, I think we intuitively do something richer, but in order to test at scale, visualisation becomes a bottleneck, so you need something quantitative like this. Perhaps you might get some mileage by considering the stationary distribution of the policy-induced Markov chain? It can be approximated by multiplying the transition matrix by itself a few times! Obviously that matrix is technically quadratic size in state count, but it’s also very sparse :) so that might be relatively tractable given that you’ve already computed a NN forward pass for each state by to get to this point. Or you could eigendecompose the transition matrix.
Second, this seems well-informed to me, but I can’t really see the connection to (my understanding of) shard theory here, other than it being Team Shard! Maybe that’ll be clearer in a later post.
Mostly in a later post. Ultimately, shard theory makes claims about goal/value formation in agents. In particular, some shard-theory flavored claims are:
That agents will have multiple, contextually activated goals and values
That we can predict what goals will be activated by considering what historical reinforcement events pertain to a given situation (e.g. is the cheese near the top-right corner, or not?)
That the multiple goals are each themselves made out of small pieces/circuits called “subshards” which can be separately manipulated or activated or influenced (see e.g. channels 55 and 42 having different effects when intervened upon)
So—we looked for “shards”, and (I think) found them.
That it’s profitable to think of agents as having multiple contextual goals, instead of thinking of them as “optimizing for a fixed objective”
(I would not have tried this project or its interventions if not for shard theory, and found shard theory reasoning very helpful throughout the project, and have some sense of having cut to empirical truths more quickly because of that theory. But I haven’t yet done deep credit assignment on this question. I think a more careful credit assignment will come down to looking at my preregistered predictions and reasoning.)
That we can predict what goals agents will form by considering their reinforcement schedules,
And we should gain skill at this art, today, now, in current systems. It seems like a clear alignment win to be able to loosely predict what goals/generalization behavior will be produced by a training process.
There are probably more ties I haven’t thought of. But hopefully this gives a little context!