Good ideas! I worry that a shallow MLP wouldn’t be capable enough to see a rich signal in the direction of increasing agency, but we should certainly try to do the easy version first.
I think unless you’re extremely lucky and this turns out to be a highly human-visible thing somehow, you’d never notice what you’re looking for among all the other complicated changes happening that nobody has analysis tools or even vaguedefinitions for yet.
I don’t think I’m seeing the complexity you’re seeing here. For instance, one method we plan on trying is taking sets of heads and MLPs, and reverting them to their og values to see that set’s qualitative influence on behavior. I don’t think this requires rigorous operationalizations.
An example: In a chess-playing context, this will lead to different moves, or out-of-action-space-behavior. The various kinds of out-of-action-space behavior or biases in move changes seem like they’d give us insight into what the head-set was doing, even if we don’t understand the mechanisms used inside the head set.
I don’t think I’m seeing the complexity you’re seeing here. For instance, one method we plan on trying is taking sets of heads and MLPs, and reverting them to their og values to see that set’s qualitative influence on behavior. I don’t think this requires rigorous operationalizations.
That sounds to me like it would give you a very rough, microscope-level view of all the individual things the training is changing around. I am sceptical that by looking at this ground-level data, you’d be able to separate out the things-that-are-agency from everything else that’s happening.
As an analogy, looking at what happens if you change the wave functions of particular clumps of silica atoms doesn’t help you much in divining how the IBM 608 divides numbers, if you haven’t even worked out yet that the atoms in the machine are clustered into things like transistors and cables, and actually, you don’t even really know how dividing numbers works even on a piece of paper, you just think of division as “the inverse of multiplication”.
Good ideas! I worry that a shallow MLP wouldn’t be capable enough to see a rich signal in the direction of increasing agency, but we should certainly try to do the easy version first.
I don’t think I’m seeing the complexity you’re seeing here. For instance, one method we plan on trying is taking sets of heads and MLPs, and reverting them to their og values to see that set’s qualitative influence on behavior. I don’t think this requires rigorous operationalizations.
An example: In a chess-playing context, this will lead to different moves, or out-of-action-space-behavior. The various kinds of out-of-action-space behavior or biases in move changes seem like they’d give us insight into what the head-set was doing, even if we don’t understand the mechanisms used inside the head set.
That sounds to me like it would give you a very rough, microscope-level view of all the individual things the training is changing around. I am sceptical that by looking at this ground-level data, you’d be able to separate out the things-that-are-agency from everything else that’s happening.
As an analogy, looking at what happens if you change the wave functions of particular clumps of silica atoms doesn’t help you much in divining how the IBM 608 divides numbers, if you haven’t even worked out yet that the atoms in the machine are clustered into things like transistors and cables, and actually, you don’t even really know how dividing numbers works even on a piece of paper, you just think of division as “the inverse of multiplication”.