You think I take the original argument to be arguing from ‘has goals’ to ‘has goals’, essentially, and agree that that holds, but don’t find it very interesting/relevant.
Well, I do think it is an interesting/relevant argument (because as you say it explains how you get from “weakly has goals” to “strongly has goals”). I just wanted to correct the misconception about what I was arguing against, and I wanted to highlight the “intelligent” --> “weakly has goals” step as a relatively weak step in our current arguments. (In my original post, my main point was that that step doesn’t follow from pure math / logic.)
In that case, my current understanding is that you are disagreeing with 2, and that you agree that if 2 holds in some case, then the argument goes through.
At least, the argument makes sense. I don’t know how strong its effect is—basically I agree with your phrasing here:
This force probably doesn’t exist out at the zero goal directness edges, but it unclear how strong it is in the rest of the space—i.e. whether it becomes substantial as soon as you move out from zero goal directedness, or is weak until you are in a few specific places right next to ‘maximally goal directed’.)
I wrote an AI Impacts page summary of the situation as I understand it. If anyone feels like looking, I’m interested in corrections/suggestions (either here or in the AI Impacts feedback box).
There’s a particular kind of cognition that considers a variety of plans, predicts their consequences, rates them according to some (simple / reasonable but not aligned) metric, and executes the one that scores highest.
That sort of cognition will consider plans that disempower humans and then directly improve the metric, as well as plans that don’t disempower humans and directly improve the metric, and in predicting the consequences of these plans and rating them will assign higher ratings to the plans that do disempower humans.
I kinda want to set aside the term “goal directed” and ask instead to what extent the AI’s cognition does the thing above (or something basically equivalent). When I’ve previously said “intelligence without goal directedness” I think what I should have been saying “cognitions that aren’t structured as described above, that nonetheless are useful to us in the real world”.
(For example, an AI that predicts consequences of plans and directly translates those consequences into human-understandable terms, would be very useful while not implementing the dangerous sort of cognition above.)
I certainly agree that even amongst programs that don’t have the structure above there are still ones that are more coherent or less coherent. I’m mostly saying that this seems not very related to whether such a system takes over the world and so I don’t think about it that much.
I think it can make sense to talk about an agent having coherent preferences over its internal state, and whether that’s a useful abstraction depends on why you’re analyzing that agent and more concrete details of the setup.
For instance say you have an oracle AI that takes compute steps to minimize incoherence in its world models. It doesn’t evaluate or compare steps before taking them, atleast as understood in the “agent” paradigm. But it is intelligent in some sense and it somehow takes the correct steps a lot of the time.
If you connect this AI to a secondary memory, it seems reasonable to predict that it will probably start writing to the new memory too, to have larger but coherent models.
I don’t buy this. You could finetune a language model today to do chain-of-thought reasoning, which sounds like “an oracle AI that takes compute steps to minimize incoherence in world models”. I predict that if you then add an additional head that can do read/write to an external memory, or you allow it to output sentences like “Write [X] to memory slot [Y]” or “Read from memory slot [Y]” that we support via an external database, it does not then start using that memory in some particularly useful way.
You could say that this is because current language models are too dumb, but I don’t particularly see why this will necessarily change in the future (besides that we will probably specifically train the models to use external memory). Overall I’m at “sure maybe a sufficiently intelligent oracle AI would do this but it seems like whether it happens depends on details of how the AI works”.
Yes, that’s basically right.
Well, I do think it is an interesting/relevant argument (because as you say it explains how you get from “weakly has goals” to “strongly has goals”). I just wanted to correct the misconception about what I was arguing against, and I wanted to highlight the “intelligent” --> “weakly has goals” step as a relatively weak step in our current arguments. (In my original post, my main point was that that step doesn’t follow from pure math / logic.)
At least, the argument makes sense. I don’t know how strong its effect is—basically I agree with your phrasing here:
I wrote an AI Impacts page summary of the situation as I understand it. If anyone feels like looking, I’m interested in corrections/suggestions (either here or in the AI Impacts feedback box).
Looks good to me :)
There’s a particular kind of cognition that considers a variety of plans, predicts their consequences, rates them according to some (simple / reasonable but not aligned) metric, and executes the one that scores highest.
That sort of cognition will consider plans that disempower humans and then directly improve the metric, as well as plans that don’t disempower humans and directly improve the metric, and in predicting the consequences of these plans and rating them will assign higher ratings to the plans that do disempower humans.
I kinda want to set aside the term “goal directed” and ask instead to what extent the AI’s cognition does the thing above (or something basically equivalent). When I’ve previously said “intelligence without goal directedness” I think what I should have been saying “cognitions that aren’t structured as described above, that nonetheless are useful to us in the real world”.
(For example, an AI that predicts consequences of plans and directly translates those consequences into human-understandable terms, would be very useful while not implementing the dangerous sort of cognition above.)
I certainly agree that even amongst programs that don’t have the structure above there are still ones that are more coherent or less coherent. I’m mostly saying that this seems not very related to whether such a system takes over the world and so I don’t think about it that much.
I think it can make sense to talk about an agent having coherent preferences over its internal state, and whether that’s a useful abstraction depends on why you’re analyzing that agent and more concrete details of the setup.
I don’t buy this. You could finetune a language model today to do chain-of-thought reasoning, which sounds like “an oracle AI that takes compute steps to minimize incoherence in world models”. I predict that if you then add an additional head that can do read/write to an external memory, or you allow it to output sentences like “Write [X] to memory slot [Y]” or “Read from memory slot [Y]” that we support via an external database, it does not then start using that memory in some particularly useful way.
You could say that this is because current language models are too dumb, but I don’t particularly see why this will necessarily change in the future (besides that we will probably specifically train the models to use external memory). Overall I’m at “sure maybe a sufficiently intelligent oracle AI would do this but it seems like whether it happens depends on details of how the AI works”.