I don’t see why you can’t just ask at each point in time “Which action would maximize the expected value of X”. It seems like asking once and asking many times as new things happen in reality don’t have particularly different properties.
More detailed comment
Paul noted:
It’s pretty unclear if a system that is good at answering the question “Which action would maximize the expected amount of X?” also “wants” X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system “Which action would maximize the expected amount of Y?” whether it will also be wanting the same thing, or whether it will just be using cognitive procedures that are good at figuring out what actions lead to what consequences.
An earlier Nate comment (not in response) is:
Part of what’s going on here is that reality is large and chaotic. When you’re dealing with a large and chaotic reality, you don’t get to generate a full plan in advance, because the full plan is too big. Like, imagine a reasoner doing biological experimentation. If you try to “unroll” that reasoner into an advance plan that does not itself contain the reasoner, then you find yourself building this enormous decision-tree, like “if the experiments come up this way, then I’ll follow it up with this experiment, and if instead it comes up that way, then I’ll follow it up with that experiment”, and etc. This decision tree quickly explodes in size. And even if we didn’t have a memory problem, we’d have a time problem—the thing to do in response to surprising experimental evidence is often “conceptually digest the results” and “reorganize my ontology accordingly”. If you’re trying to unroll that reasoner into a decision-tree that you can write down in advance, you’ve got to do the work of digesting not only the real results, but the hypothetical alternative results, and figure out the corresponding alternative physics and alternative ontologies in those branches. This is infeasible, to say the least.
Reasoners are a way of compressing plans, so that you can say “do some science and digest the actual results”, instead of actually calculating in advance how you’d digest all the possible observations. (Note that the reasoner specification comprises instructions for digesting a wide variety of observations, but in practice it mostly only digests the actual observations.)
But, can’t you just query the reasoner at each point for what a good action would be? And then, it seems unclear if the AI actually “wants” the long run outcome vs just “wants” to give a good response or something else entirely.
Maybe the claim is that if you do this, it’s equivalent to just training the reasoner to do the long term outcome (which will get you a reasoner which want long term outcomes). Or it would only work if the reasoner had the ability to solve long-horizon tasks directly which itself might imply it’s likely to want to do this. But this seems at least unclear for reasonable training schemes.
For instance, imagine you train an AI with purely process based feedback to take actions. As in, I want to train my AI to accomplish objectives over the course of 6 months. So, I have a human review actions the AI took over a 1 hour period and rate these actions based on how good these actions seem for accomplishing the long term objective. It seems like this feedback is likely to deviate considerably from the best way to accomplish the long run objective in ways which make danger less likely. In particular, it seems far less likely that the AI will ‘want’ long term outcomes rather than ‘wanting’ to do an action such that the human rater will think the action will lead to good long term consequences (or some other proxy ‘want’ entirely).
(Note that just because the feedback differs considerably doesn’t mean it’s way less competitive, it might be, but that will depend on more details.)
It’s totally consistent to have the view ‘AIs which just aim to satisify local measures of goodness (e.g. a human thinks this action is good) will never be able to accomplish long run outcomes without immense performance penalties’, but I think this seems at least unclear. Further, training based mostly on long run feedback is very expensive (even if we’re thinking about time scales more like 2 hours than 6 months which is more plausible anyway).
More generally, it seems like we can build systems that succeed in accomplishing long run goals without having the core components which are doing this actually ‘want’ to accomplish any long run goal.
It seems like this is common for corporations and we see similar dynamics for language model agents.
I do not expect you to be able to give an example of a corporation that is a central example of this without finding that there is in fact a “want” implemented in the members of the corporation wanting to satisfy their bosses, who in turn want to satisfy theirs, etc. Corporations are generally supervisor trees where bosses set up strong incentives, and it seems to me that this produces a significant amount of aligned wanting in the employees, though of course there’s also backpressure.
But, can’t you just query the reasoner at each point for what a good action would be?
What I’d expect (which may or may not be similar to Nate!’s approach) is that the reasoner has prepared one plan (or a few plans). Despite being vastly intelligent, it doesn’t have the resources to scan all the world’s outcomes and compare their goodness. It can give you the results of acting on the primary (and maybe several secondary) goal(s) and perhaps the immediate results of doing nothing or other immediate stuff.
It seems to me that Nate! (as quoted above about chess) is making the very cogent (imo) point that even a highly, superhumanly competent entity acting on the real, vastly complicated world isn’t going to be an exact oracle, isn’t going to have access to exact probabilities of things or probabilities of probabilities of outcomes and so-forth. It will know the probabilities of some things certainly but many other results will it can only pursue a strategy deemed good based on much more indirect processes. And this is because an exact calculation of the outcome process of the world in questions tends “blows up” far beyond any computing power physically available in the foreseeable future.
I don’t see why you can’t just ask at each point in time “Which action would maximize the expected value of X”. It seems like asking once and asking many times as new things happen in reality don’t have particularly different properties.
More detailed comment
Paul noted:
An earlier Nate comment (not in response) is:
But, can’t you just query the reasoner at each point for what a good action would be? And then, it seems unclear if the AI actually “wants” the long run outcome vs just “wants” to give a good response or something else entirely.
Maybe the claim is that if you do this, it’s equivalent to just training the reasoner to do the long term outcome (which will get you a reasoner which want long term outcomes). Or it would only work if the reasoner had the ability to solve long-horizon tasks directly which itself might imply it’s likely to want to do this. But this seems at least unclear for reasonable training schemes.
For instance, imagine you train an AI with purely process based feedback to take actions. As in, I want to train my AI to accomplish objectives over the course of 6 months. So, I have a human review actions the AI took over a 1 hour period and rate these actions based on how good these actions seem for accomplishing the long term objective. It seems like this feedback is likely to deviate considerably from the best way to accomplish the long run objective in ways which make danger less likely. In particular, it seems far less likely that the AI will ‘want’ long term outcomes rather than ‘wanting’ to do an action such that the human rater will think the action will lead to good long term consequences (or some other proxy ‘want’ entirely).
(Note that just because the feedback differs considerably doesn’t mean it’s way less competitive, it might be, but that will depend on more details.)
It’s totally consistent to have the view ‘AIs which just aim to satisify local measures of goodness (e.g. a human thinks this action is good) will never be able to accomplish long run outcomes without immense performance penalties’, but I think this seems at least unclear. Further, training based mostly on long run feedback is very expensive (even if we’re thinking about time scales more like 2 hours than 6 months which is more plausible anyway).
More generally, it seems like we can build systems that succeed in accomplishing long run goals without having the core components which are doing this actually ‘want’ to accomplish any long run goal.
It seems like this is common for corporations and we see similar dynamics for language model agents.
(Again, efficiency concerns are reasonable.)
I do not expect you to be able to give an example of a corporation that is a central example of this without finding that there is in fact a “want” implemented in the members of the corporation wanting to satisfy their bosses, who in turn want to satisfy theirs, etc. Corporations are generally supervisor trees where bosses set up strong incentives, and it seems to me that this produces a significant amount of aligned wanting in the employees, though of course there’s also backpressure.
I agree that there is want, but it’s very unclear if this needs to be long run ‘want’.
(And for danger, it seems the horizon of want matters a lot.)
What I’d expect (which may or may not be similar to Nate!’s approach) is that the reasoner has prepared one plan (or a few plans). Despite being vastly intelligent, it doesn’t have the resources to scan all the world’s outcomes and compare their goodness. It can give you the results of acting on the primary (and maybe several secondary) goal(s) and perhaps the immediate results of doing nothing or other immediate stuff.
It seems to me that Nate! (as quoted above about chess) is making the very cogent (imo) point that even a highly, superhumanly competent entity acting on the real, vastly complicated world isn’t going to be an exact oracle, isn’t going to have access to exact probabilities of things or probabilities of probabilities of outcomes and so-forth. It will know the probabilities of some things certainly but many other results will it can only pursue a strategy deemed good based on much more indirect processes. And this is because an exact calculation of the outcome process of the world in questions tends “blows up” far beyond any computing power physically available in the foreseeable future.