There is also the “lottery ticket hypothesis” to consider (discussed on LW here and here) -- the idea that a big neural network functions primarily like a bag of hypotheses, not like one hypothesis which gets adapted toward the right thing. We can imagine different parts of the network fighting for control, much like the Bayesian hypotheses.
This is a fascinating point. I’m curious now how bad things can get if your lottery tickets have side channels but aren’t deceptive. It might be that the evolving-to-extinction policy of making the world harder to predict through logs is complicated enough that it can only emerge through a deceptive ticket deciding to pursue it—or it could be the case that it’s simple enough that one ticket could randomly start writing stuff to logs, get selected for, and end up pursuing such a policy without ever actually having come up with it explicitly. This seems likely to depend on how powerful your base optimization process is and how easy it is to influence the world through side-channels. If it’s the case that you need deception, then this probably isn’t any worse than the gradient hacking problem (though possibly it gives us more insight into how gradient hacking might work)—but if it can happen without deception, then this sort of evolving-to-extinction behavior could be a serious problem in its own right.
Suppose I have some active learning setup, where I decide new points to investigate based on expected uncertainty reduction, or update to the model weights, or something. Then it seems like the internals of the model could be an example of these diagnostic prediction logs being relevant without having to have humans look at them. Then it seems like there might be competition among subnetworks to have the new training examples be places where they’ll do particularly well, or to somehow avoid areas where they’ll do poorly.
I have a hard time making this story one where this is a bug instead of a feature, tho; in order for a subnetwork to do particularly well, it has to know something about the real data-generating distribution that the rest of the model doesn’t. This only looks pathological if the thing that it knows is manufactured by the model, somehow. (Like, if I can write a fictional story and win trivia contests based on my fictional story, then I can hack points.)
It might be that the evolving-to-extinction policy of making the world harder to predict through logs is complicated enough that it can only emerge through a deceptive ticket deciding to pursue it—or it could be the case that it’s simple enough that one ticket could randomly start writing stuff to logs, get selected for, and end up pursuing such a policy without ever actually having come up with it explicitly.
I’m not sure about the latter. Suppose there is a “simple” ticket that randomly writes stuff to the logs in a way that makes future training examples harder to predict. I don’t see what would cause that ticket to be selected for.
If that ticket is better at predicting the random stuff it’s writing to the logs—which it should be if it’s generating that randomness—then that would be sufficient. However, that does rely on the logs directly being part of the prediction target rather than only through some complicated function like a human seeing them.
This is a fascinating point. I’m curious now how bad things can get if your lottery tickets have side channels but aren’t deceptive. It might be that the evolving-to-extinction policy of making the world harder to predict through logs is complicated enough that it can only emerge through a deceptive ticket deciding to pursue it—or it could be the case that it’s simple enough that one ticket could randomly start writing stuff to logs, get selected for, and end up pursuing such a policy without ever actually having come up with it explicitly. This seems likely to depend on how powerful your base optimization process is and how easy it is to influence the world through side-channels. If it’s the case that you need deception, then this probably isn’t any worse than the gradient hacking problem (though possibly it gives us more insight into how gradient hacking might work)—but if it can happen without deception, then this sort of evolving-to-extinction behavior could be a serious problem in its own right.
Suppose I have some active learning setup, where I decide new points to investigate based on expected uncertainty reduction, or update to the model weights, or something. Then it seems like the internals of the model could be an example of these diagnostic prediction logs being relevant without having to have humans look at them. Then it seems like there might be competition among subnetworks to have the new training examples be places where they’ll do particularly well, or to somehow avoid areas where they’ll do poorly.
I have a hard time making this story one where this is a bug instead of a feature, tho; in order for a subnetwork to do particularly well, it has to know something about the real data-generating distribution that the rest of the model doesn’t. This only looks pathological if the thing that it knows is manufactured by the model, somehow. (Like, if I can write a fictional story and win trivia contests based on my fictional story, then I can hack points.)
I’m not sure about the latter. Suppose there is a “simple” ticket that randomly writes stuff to the logs in a way that makes future training examples harder to predict. I don’t see what would cause that ticket to be selected for.
If that ticket is better at predicting the random stuff it’s writing to the logs—which it should be if it’s generating that randomness—then that would be sufficient. However, that does rely on the logs directly being part of the prediction target rather than only through some complicated function like a human seeing them.