I expect looking at the internals of trained neural networks will give lots of feedback about what the natural data structures are.
Okay, really rough idea on how to identify where a ML model’s goals are stored + measure how much of an optimizer it is. If successful, it might provide a decent starting point for disentangling concepts from each other.
The Ground of Optimization mentions “retargetability” as one of the variables of optimizing systems. How much of the system do you need to change in order to make it optimize towards a different target configuration? Can you easily split the system into the optimizer and the optimized? For example: In a house-plus-construction-company system, we just need to vary the house’s schematics to make the system optimize towards wildly different houses. Conversely, to make a ball placed at the edge of a giant inverted cone come to rest in a different location, we’d need to change the shape of the entire cone.
Intuitively, it seems like it should be possible to identify goals in neural networks the same way. A “goal” is the minimal set of parameters that you need to perturb in order to make the network optimize a meaningfully different metric without any loss of capability.
Various shallow pattern-matchers/look-up tables are not easily retargetable — you’d need to rewrite most of their parameters. They’re more like inverted cones.
Idealized mesa-optimizers with a centralized crystallized mesa-objective are very retargetable — their utility function is precisely mathematically defined, disentangled from capabilities, and straightforwardly rewritten.
Intermediate systems — e. g., shard economies/heuristics over world-models are somewhat retargetable. There may be limited dimensions along which their mesa-objectives may be changed without capability loss, limited “angles” in concept-space by which their targeting may be adjusted. Alternatively/additionally, you’d need to rewrite the entire suite of shards/heuristics at once and in a cross-dependent manner.
As a bonus, the fraction of parameters you need to change to retarget the system roughly tells you how much of an optimizer it is.
The question is how to implement this. It’s easy to imagine the algorithms that may work if we had infinite compute, but practically?
Neuron Shapleys may be a good starting point? The linked paper seems to “use the Shapley value framework to measure the importance of different neurons in determining an arbitrary metric of the neural net output”, and the authors use it to tank accuracy/remove social bias/increase robustness to adversarial attacks just by rewriting a few neurons. It might be possible to do something similar to detect goal-encoding neurons? Haven’t looked into it in-depth yet, though.
Neat idea. One thing I’d watch out for is that “subset of the neurons” might not be the right ontology for a conceptually-”small” change. E.g. in the Rome paper, they made low-rank updates rather than work with individual neurons. So bear in mind that figuring out the ontology through which to view the network’s internals may itself be part of the problem.
Okay, really rough idea on how to identify where a ML model’s goals are stored + measure how much of an optimizer it is. If successful, it might provide a decent starting point for disentangling concepts from each other.
The Ground of Optimization mentions “retargetability” as one of the variables of optimizing systems. How much of the system do you need to change in order to make it optimize towards a different target configuration? Can you easily split the system into the optimizer and the optimized? For example: In a house-plus-construction-company system, we just need to vary the house’s schematics to make the system optimize towards wildly different houses. Conversely, to make a ball placed at the edge of a giant inverted cone come to rest in a different location, we’d need to change the shape of the entire cone.
Intuitively, it seems like it should be possible to identify goals in neural networks the same way. A “goal” is the minimal set of parameters that you need to perturb in order to make the network optimize a meaningfully different metric without any loss of capability.
Various shallow pattern-matchers/look-up tables are not easily retargetable — you’d need to rewrite most of their parameters. They’re more like inverted cones.
Idealized mesa-optimizers with a centralized crystallized mesa-objective are very retargetable — their utility function is precisely mathematically defined, disentangled from capabilities, and straightforwardly rewritten.
Intermediate systems — e. g., shard economies/heuristics over world-models are somewhat retargetable. There may be limited dimensions along which their mesa-objectives may be changed without capability loss, limited “angles” in concept-space by which their targeting may be adjusted. Alternatively/additionally, you’d need to rewrite the entire suite of shards/heuristics at once and in a cross-dependent manner.
As a bonus, the fraction of parameters you need to change to retarget the system roughly tells you how much of an optimizer it is.
The question is how to implement this. It’s easy to imagine the algorithms that may work if we had infinite compute, but practically?
Neuron Shapleys may be a good starting point? The linked paper seems to “use the Shapley value framework to measure the importance of different neurons in determining an arbitrary metric of the neural net output”, and the authors use it to tank accuracy/remove social bias/increase robustness to adversarial attacks just by rewriting a few neurons. It might be possible to do something similar to detect goal-encoding neurons? Haven’t looked into it in-depth yet, though.
Neat idea. One thing I’d watch out for is that “subset of the neurons” might not be the right ontology for a conceptually-”small” change. E.g. in the Rome paper, they made low-rank updates rather than work with individual neurons. So bear in mind that figuring out the ontology through which to view the network’s internals may itself be part of the problem.