I continue to think that you are misinterpreting the old writings as making predictions that they did not in fact make.
We don’t need to talk about predictions. We can instead talk about whether their proposed problems are on their way towards being solved. For example, we can ask whether the shutdown problem for systems with big picture awareness is being solved, and I think the answer is pretty clearly “Yes”.
(Note that you can trivially claim the problem here isn’t being solved because we haven’t solved the unbounded form of the problem for consequentialist agents, who (perhaps by definition) avoid shutdown by default. But that seems like a red herring: we can just build corrigible agents, rather than consequentialist agents.)
Moreover, I think people generally did not make predictions at all when writing about AI alignment, perhaps because that’s not very common when theorizing about these matters. I’m frustrated about that, because I think if they did make predictions, they would likely have been wrong in roughly the direction I’m pointing at here. That said, I don’t think people should get credit for failing to make any predictions, and as a consequence, failing to get proven wrong.
To the extent their predictions were proven correct, we should give them credit. But to the extent they made no predictions, it’s hard to see why that vindicates them. And regardless of any predictions they may or may not have made, it’s still useful to point out that we seem to be making progress on several problems that people pointed out at the time.
Great, let’s talk about whether proposed problems are on their way towards being solved. I much prefer that framing and I would not have objected so strongly if that’s what you had said. E.g. suppose you had said “Hey, why don’t we just prompt AutoGPT-5 with lots of corrigibility instructions?” then we could have a more technical conversation about whether or not that’ll work, and the answer is probably no, BUT I do agree that this is looking promising relative to e.g. the alternative world where we train powerful alien agents in various video games and simulations and then try to teach them English. (I say more about this elsewhere in this conversation, for those just tuning in!)
I don’t think current system systems are well described as having “big picture awareness”. From my experiments with Claude, it makes cartoonish errors reasoning about various AI related situations and can’t do such reasoning except aloud.
I’m not certain this was your claim, but it seems to have been.
it makes cartoonish errors reasoning about various AI related situations and can’t do such reasoning except aloud
Wouldn’t reasoning aloud be enough though, if it was good enough? Also, I expect reasoning aloud first to be the modal scenario, given theoretical results on Chain of Thought and the like.
My claim was not that current LLMs have a high level of big picture awareness.
Instead, I claim current systems have limited situational awareness, which is not yet human-level, but is definitely above zero. I further claim that solving the shutdown problem for AIs with limited (non-zero) situational awareness gives you evidence about how hard it will be to solve the problem for AIs with more situational awareness.
And I’d predict that, if we design a proper situational awareness benchmark, and (say) GPT-5 or GPT-6 passes with flying colors, it will likely be easy to shut down the system, or delete all its copies, with no resistance-by-default from the system.
And if you think that wouldn’t count as an adequate solution to the problem, then it’s not clear the problem was coherent as written in the first place.
We don’t need to talk about predictions. We can instead talk about whether their proposed problems are on their way towards being solved. For example, we can ask whether the shutdown problem for systems with big picture awareness is being solved, and I think the answer is pretty clearly “Yes”.
(Note that you can trivially claim the problem here isn’t being solved because we haven’t solved the unbounded form of the problem for consequentialist agents, who (perhaps by definition) avoid shutdown by default. But that seems like a red herring: we can just build corrigible agents, rather than consequentialist agents.)
Moreover, I think people generally did not make predictions at all when writing about AI alignment, perhaps because that’s not very common when theorizing about these matters. I’m frustrated about that, because I think if they did make predictions, they would likely have been wrong in roughly the direction I’m pointing at here. That said, I don’t think people should get credit for failing to make any predictions, and as a consequence, failing to get proven wrong.
To the extent their predictions were proven correct, we should give them credit. But to the extent they made no predictions, it’s hard to see why that vindicates them. And regardless of any predictions they may or may not have made, it’s still useful to point out that we seem to be making progress on several problems that people pointed out at the time.
Great, let’s talk about whether proposed problems are on their way towards being solved. I much prefer that framing and I would not have objected so strongly if that’s what you had said. E.g. suppose you had said “Hey, why don’t we just prompt AutoGPT-5 with lots of corrigibility instructions?” then we could have a more technical conversation about whether or not that’ll work, and the answer is probably no, BUT I do agree that this is looking promising relative to e.g. the alternative world where we train powerful alien agents in various video games and simulations and then try to teach them English. (I say more about this elsewhere in this conversation, for those just tuning in!)
I don’t think current system systems are well described as having “big picture awareness”. From my experiments with Claude, it makes cartoonish errors reasoning about various AI related situations and can’t do such reasoning except aloud.
I’m not certain this was your claim, but it seems to have been.
Wouldn’t reasoning aloud be enough though, if it was good enough? Also, I expect reasoning aloud first to be the modal scenario, given theoretical results on Chain of Thought and the like.
My claim was not that current LLMs have a high level of big picture awareness.
Instead, I claim current systems have limited situational awareness, which is not yet human-level, but is definitely above zero. I further claim that solving the shutdown problem for AIs with limited (non-zero) situational awareness gives you evidence about how hard it will be to solve the problem for AIs with more situational awareness.
And I’d predict that, if we design a proper situational awareness benchmark, and (say) GPT-5 or GPT-6 passes with flying colors, it will likely be easy to shut down the system, or delete all its copies, with no resistance-by-default from the system.
And if you think that wouldn’t count as an adequate solution to the problem, then it’s not clear the problem was coherent as written in the first place.