I’m curious if you’ve thought of this, but I think this doesn’t generalize to models whose outputs are too hard to evaluate?
Consider a different model we might train in the future that is trained on all the internal data in a company — all the emails, slack, stock price, trades, spreadsheets, etc, and is trained to predict the stock movement, and then is used to generate the rest of the emails from the CEO and so forth that it predicts will increase the stock price further. It is then given the actual role of CEO and its instructions are followed.
Suppose that you perform your trick of randomly fiddling with the weights, and find that the stock performs substantially better when you do so. You inquire with the model as to why, and it tells you that it is taking a strategy with short-term payoffs but too much long-term risk, and your tampering is causing it to make worse decisions. If you ask it to explain why its current strategy is better, it says it’s for complicated reasons that it cannot well-explain to a mere IQ 150 human. Is it so clear now whether the model is intentionally pretending to be less competent than it is, or not?
[Meta: I will probably come back and address some of your other points tomorrow, but in 2 minutes my laptop is going to transform itself into a rather expensive paperweight.]
Yes, the sandbagging detection is quite limited. I anticipate it’s mostly useful as an “oh no, something weird is going on” alarm. I don’t think it will be particularly helpful in figuring out what exactly.
I think at that point you need to delay deployment and use other mechanistic interpretability techniques to explore the anomaly in depth.
Since this is intended as an “anomaly-detecting in-the-lab test, for use on well understood benchmarks”, I think it will probably work even on quite intelligent models. My hope is that the noise injection impairment will smoothly compromise the model’s capabilites down to arbitrarily lower performance. So, for a smarter model, you just increase the impairment.
In theory , this would work up until the model was so capable that it was able to do something like gradient hacking to engineer its expected impairment curve to avoid betraying itself. At that point of capabilities I can only hope the researchers have been wise enough to move to a censored training simulation regime, so that the model can’t anticipate the need to be robust to noise injection.
That’s a clever trick!
I’m curious if you’ve thought of this, but I think this doesn’t generalize to models whose outputs are too hard to evaluate?
Consider a different model we might train in the future that is trained on all the internal data in a company — all the emails, slack, stock price, trades, spreadsheets, etc, and is trained to predict the stock movement, and then is used to generate the rest of the emails from the CEO and so forth that it predicts will increase the stock price further. It is then given the actual role of CEO and its instructions are followed.
Suppose that you perform your trick of randomly fiddling with the weights, and find that the stock performs substantially better when you do so. You inquire with the model as to why, and it tells you that it is taking a strategy with short-term payoffs but too much long-term risk, and your tampering is causing it to make worse decisions. If you ask it to explain why its current strategy is better, it says it’s for complicated reasons that it cannot well-explain to a mere IQ 150 human. Is it so clear now whether the model is intentionally pretending to be less competent than it is, or not?
[Meta: I will probably come back and address some of your other points tomorrow, but in 2 minutes my laptop is going to transform itself into a rather expensive paperweight.]
Yes, the sandbagging detection is quite limited. I anticipate it’s mostly useful as an “oh no, something weird is going on” alarm. I don’t think it will be particularly helpful in figuring out what exactly. I think at that point you need to delay deployment and use other mechanistic interpretability techniques to explore the anomaly in depth.
Since this is intended as an “anomaly-detecting in-the-lab test, for use on well understood benchmarks”, I think it will probably work even on quite intelligent models. My hope is that the noise injection impairment will smoothly compromise the model’s capabilites down to arbitrarily lower performance. So, for a smarter model, you just increase the impairment.
In theory , this would work up until the model was so capable that it was able to do something like gradient hacking to engineer its expected impairment curve to avoid betraying itself. At that point of capabilities I can only hope the researchers have been wise enough to move to a censored training simulation regime, so that the model can’t anticipate the need to be robust to noise injection.