I actually agree! As I wrote in my post, “GPT is not an agent, [but] it can “play one on TV” if asked to do so in its prompt.” So yes, you wouldn’t need a lot of scaffolding to adapt a goal-less pretrained model (what I call an “intelligence forklift”) into an agent that does very sophisticated things.
However, this separation into two components—the super-intelligent but goal-less “brain”, and the simple “will” that turns it into an agent can have safety implications. For starters, as long as you didn’t add any scaffolding, you are still OK. So during most of the time you spend training, you are not worrying about the system itself developing goals. (Though you could still worry about hackers.) Once you start adapting it, then you need to start worrying about this.
The other thing is that, as I wrote there, it does change some of the safety picture. The traditional view of a super-intelligent AI is of the “brains and agency” tightly coupled together, just like they are in a human. For example, a human is super-good at finding vulnerabilities and breaking into systems, they have the capability to also help fix systems, but I can’t just take their brain and fine-tune it on this task. I have to convince them to do it.
However, things change if we don’t think of the agent’s “brain” as belonging to them, but rather as some resource that they are using. (Just like if I use a forklift to lift something heavy.) In particular it means that capabilities and intentions might not be tightly coupled—there could be agents using capabilities to do very bad things, but the same capabilities could be used by other agents to do good things.
Getting the capabilities to be used by other agents to do good things could still be tricky and/or risky, when reinforcement is vulnerable to deception and manipulation.
I still don’t think this adds up to a case for being confident that there aren’t going to be “escapes” anytime soon.
Note all capabilities / tasks correspond to trying to maximize a subjective human response. If you are talking about finding software vulnerabilities, design some system, there may well be objective measures of success. In such a case, you can fine tune a system to maximize these measures and so extract capabilities without the issue of deception/manipulation.
Regarding “escapes”, the traditional fear was that because that AI is essentially code, it can spread and escape more easily. But I think that in some sense modern AI has a physical footprint that is more significant than humans. Think of trying to get superhuman scientific capabilities by doing something like simulating a collection of a1000 scientists using a 100T or so parameter model. Even if you already have the pre-trained weights, just running the model requires highly non-trivial computing infrastructure. (Which may be possible to track and detect.) So. it might be easier for a human to escape a prison and live undetected, than for a superhuman AI to “escape”.
I think training exclusively on objective measures has a couple of other issues:
For sufficiently open-ended training, objective performance metrics could incentivize manipulating and deceiving humans to accomplish the objective. A simple example would be training an AI to make money, which might incentivize illegal/unethical behavior.
For less open-ended training, I basically just think you can only get so much done this way, and people will want to use fuzzier “approval” measures to get help from AIs with fuzzier goals (this seems to be how things are now with LLMs).
I think your point about the footprint is a good one and means we could potentially be very well-placed to track “escaped” AIs if a big effort were put in to do so. But I don’t see signs of that effort today and don’t feel at all confident that it will happen in time to stop an “escape.”
I actually agree! As I wrote in my post, “GPT is not an agent, [but] it can “play one on TV” if asked to do so in its prompt.” So yes, you wouldn’t need a lot of scaffolding to adapt a goal-less pretrained model (what I call an “intelligence forklift”) into an agent that does very sophisticated things.
However, this separation into two components—the super-intelligent but goal-less “brain”, and the simple “will” that turns it into an agent can have safety implications. For starters, as long as you didn’t add any scaffolding, you are still OK. So during most of the time you spend training, you are not worrying about the system itself developing goals. (Though you could still worry about hackers.) Once you start adapting it, then you need to start worrying about this.
The other thing is that, as I wrote there, it does change some of the safety picture. The traditional view of a super-intelligent AI is of the “brains and agency” tightly coupled together, just like they are in a human. For example, a human is super-good at finding vulnerabilities and breaking into systems, they have the capability to also help fix systems, but I can’t just take their brain and fine-tune it on this task. I have to convince them to do it.
However, things change if we don’t think of the agent’s “brain” as belonging to them, but rather as some resource that they are using. (Just like if I use a forklift to lift something heavy.) In particular it means that capabilities and intentions might not be tightly coupled—there could be agents using capabilities to do very bad things, but the same capabilities could be used by other agents to do good things.
I agree with these points! But:
Getting the capabilities to be used by other agents to do good things could still be tricky and/or risky, when reinforcement is vulnerable to deception and manipulation.
I still don’t think this adds up to a case for being confident that there aren’t going to be “escapes” anytime soon.
Note all capabilities / tasks correspond to trying to maximize a subjective human response. If you are talking about finding software vulnerabilities, design some system, there may well be objective measures of success. In such a case, you can fine tune a system to maximize these measures and so extract capabilities without the issue of deception/manipulation.
Regarding “escapes”, the traditional fear was that because that AI is essentially code, it can spread and escape more easily. But I think that in some sense modern AI has a physical footprint that is more significant than humans. Think of trying to get superhuman scientific capabilities by doing something like simulating a collection of a1000 scientists using a 100T or so parameter model. Even if you already have the pre-trained weights, just running the model requires highly non-trivial computing infrastructure. (Which may be possible to track and detect.) So. it might be easier for a human to escape a prison and live undetected, than for a superhuman AI to “escape”.
I think training exclusively on objective measures has a couple of other issues:
For sufficiently open-ended training, objective performance metrics could incentivize manipulating and deceiving humans to accomplish the objective. A simple example would be training an AI to make money, which might incentivize illegal/unethical behavior.
For less open-ended training, I basically just think you can only get so much done this way, and people will want to use fuzzier “approval” measures to get help from AIs with fuzzier goals (this seems to be how things are now with LLMs).
I think your point about the footprint is a good one and means we could potentially be very well-placed to track “escaped” AIs if a big effort were put in to do so. But I don’t see signs of that effort today and don’t feel at all confident that it will happen in time to stop an “escape.”