Note all capabilities / tasks correspond to trying to maximize a subjective human response. If you are talking about finding software vulnerabilities, design some system, there may well be objective measures of success. In such a case, you can fine tune a system to maximize these measures and so extract capabilities without the issue of deception/manipulation.
Regarding “escapes”, the traditional fear was that because that AI is essentially code, it can spread and escape more easily. But I think that in some sense modern AI has a physical footprint that is more significant than humans. Think of trying to get superhuman scientific capabilities by doing something like simulating a collection of a1000 scientists using a 100T or so parameter model. Even if you already have the pre-trained weights, just running the model requires highly non-trivial computing infrastructure. (Which may be possible to track and detect.) So. it might be easier for a human to escape a prison and live undetected, than for a superhuman AI to “escape”.
I think training exclusively on objective measures has a couple of other issues:
For sufficiently open-ended training, objective performance metrics could incentivize manipulating and deceiving humans to accomplish the objective. A simple example would be training an AI to make money, which might incentivize illegal/unethical behavior.
For less open-ended training, I basically just think you can only get so much done this way, and people will want to use fuzzier “approval” measures to get help from AIs with fuzzier goals (this seems to be how things are now with LLMs).
I think your point about the footprint is a good one and means we could potentially be very well-placed to track “escaped” AIs if a big effort were put in to do so. But I don’t see signs of that effort today and don’t feel at all confident that it will happen in time to stop an “escape.”
Note all capabilities / tasks correspond to trying to maximize a subjective human response. If you are talking about finding software vulnerabilities, design some system, there may well be objective measures of success. In such a case, you can fine tune a system to maximize these measures and so extract capabilities without the issue of deception/manipulation.
Regarding “escapes”, the traditional fear was that because that AI is essentially code, it can spread and escape more easily. But I think that in some sense modern AI has a physical footprint that is more significant than humans. Think of trying to get superhuman scientific capabilities by doing something like simulating a collection of a1000 scientists using a 100T or so parameter model. Even if you already have the pre-trained weights, just running the model requires highly non-trivial computing infrastructure. (Which may be possible to track and detect.) So. it might be easier for a human to escape a prison and live undetected, than for a superhuman AI to “escape”.
I think training exclusively on objective measures has a couple of other issues:
For sufficiently open-ended training, objective performance metrics could incentivize manipulating and deceiving humans to accomplish the objective. A simple example would be training an AI to make money, which might incentivize illegal/unethical behavior.
For less open-ended training, I basically just think you can only get so much done this way, and people will want to use fuzzier “approval” measures to get help from AIs with fuzzier goals (this seems to be how things are now with LLMs).
I think your point about the footprint is a good one and means we could potentially be very well-placed to track “escaped” AIs if a big effort were put in to do so. But I don’t see signs of that effort today and don’t feel at all confident that it will happen in time to stop an “escape.”