This could give rise to mesa-optimizers with respect to the score function.
The score function doesn’t know how to score like that. By saying “find the concept of predicted future diamond” you called on the AI’s concept. But why should that concept be so robust that even when you train your step 3 AI to a much higher intelligence than the step 1 AI, it (the concept of predicted diamond) still knows how to score behavior or mechanisms in terms of how much diamond they lead to?
Where exactly does the mesa optimizer come from, how exactly is it working? That’s just a vague boogeyman which simply doesn’t exist in this model. Vague claims of “Ooh but mesa-optimizers” are fully general counterarguments against even perfectly aligned AI (like this) - and are thus meaningless until proven otherwise.
It’s very simple and obvious in this example, because step 1 results in a functional equivalent of the minecraft code, which has a perfect crisp representation of the objects of interest (diamond tools). “Train step 3 to a much higher intelligence than step 1” is meaningless as the output of step 1 is not an agent, it’s just a functional equivalent of the minecraft code.
Step 1 results in a perfect functional sim of minecraft with a perfect diamond tool concept, and step 2 results in a perfect diamond tool counting utility function. Step 3 then results in a perfectly aligned agent (assuming no weird errors in the muzero training). We could alternatively replace step 3 with a simple utility maximizer like AIXI, which would obviously then correctly maximize the correct aligned utility function. Muzero is a more practical approximation of that.
(I think I’m going to tap out because there’s too many different background assumptions we’re making here, sorry; maybe I’ll come back later.… E.g. the “diamond maximizer problem” is about our world, not a world that’s plausibly solvable by something that’s literally MuZero; and so you have to have a system that’s doing complex new interesting things, which aren’t comprehended by the concept you find in your step 1 AI.)
I never said diamond maximizer problem—I said “diamond tool maximizer in minecraft”.
Of course once you have an agent that robustly optimizes a goal in sim, then you can do sim to real transfer—which is guaranteed to work if the sim is accurate enough, and in practice that isn’t some huge theoretical problem. (The blocker on self driving cars is not sim to real transfer, for example—the sims are good enough)
The “interesting new things” that we need here are optimizations of existing concepts.
I said diamond maximizer problem, and then you responded to that talking about this other thing that turned out to be not the diamond maximizer problem.
I don’t even know how to make an agent with a clear utility function module, or anything like that. (This to my understanding is one lesson one can take from “the diamond maximizer” problem.)
So I described how to make an agent with a clear utility function to maximize diamond tools in minecraft, which is obviously related to the diamond maximizer problem and easier to understand.
If you are actually arguing that you don’t/won’t/can’t understand how to make an agent with a clear utility function module—even after my worked example, not to mention all the successful DL agents to date—unless that somehow solves the ‘diamond maximizer’ problem, then you either aren’t discussing in good faith or the inferential gap here is just too enormous and you should read more DL.
I agree that the the inferential gap here is too big, as noted above; by “agent” I of course mean “the sort of agent that is competent enough to transform the world” which implies things like “can learn new domains by its own steering” which implies that the concept of predicted diamond will have trouble understanding what these new capabilities mean.
The agent I described has the perfect model of it’s environment, and in the limit of compute can construct perfect plans to optimize for diamond tool maximization. So obviously it is the sort of agent that is competent enough to transform its world—there is no other agent more competent.
Learning a new domain (like a different sim environment) would require repeating all the steps.
which implies that the concept of predicted diamond will have trouble understanding what these new capabilities mean
The concept of predicted diamond doesn’t understand anything, so not sure what you meant there. Perhaps what you meant is that when learning new domains by its own steering, the concept of predicted diamond will need to be relearned. Yes, of course—the steps must be repeated.
Would your point here w.r.t. utility functions be fairly summarizable as the following?
An agent that actually achieves X can be obtained by having a superintelligence that understands the world including X, and then searching for code that scores highly on the question put to the superintelligence: “How much would running this code achieve X?”
I think that framing is rather strange, because in the minecraft example the superintelligent diamond tool maximizer doesn’t need to understand code or human language. It simply searches for plans that maximize diamond tools.
But assuming you could ask that question through a suitable interface the SI understood—and given some reasons to trust that giving the correct answers is instrumentally rational for the SI—then yes I agree that should work.
Ok. So yeah, I agree that in the hypothetical, actually being able to ask that question to the SI is the hard part (as opposed, for example, to it being hard for the SI to answer accurately).
My framing is definitely different than yours. The statement, as I framed it, could be interesting, but it doesn’t seem to me to answer the question about utility functions. It doesn’t explain how the code that’s found, actually encodes the idea of diamonds and does its thinking in a way that’s really, thoroughly aimed at making there be diamonds. It does that somehow, and the superintelligence knows how it does that. Be we don’t, so we, unlike the superintelligence, can’t use that analysis to be justifiedly confident that the code will actually lead to diamonds. (We can be justifiedly confident of that by some other route, e.g. because we asked the SI.)
Sure, but at that point you have substituted trust in the code representing the idea of diamonds for trust in a SI aligned to give you the correct code.
Maybe a more central thing to how our views are differing, is that I don’t view training signals as identical to utility functions. They’re obviously somehow related, but they have different roles in systems. So to me changing the training signal obviously will affect the trained system’s goals in some way, but it won’t be identical to the operation of writing some objective to an agent’s utility function, and the non-identicality will become very relevant for a very intelligent system.
Another thing to say, if you like the outer / inner alignment distinction: 1. Yes, if you have an agent that’s competent to predict some feature X of the world “sufficiently well”, and you’re able to extract the agent’s prediction, then you’ve made a lot of progress towards outer alignment for X; but
2. unfortunately your predictor agent is probably dangerous, if it’s able to predict X even when asking about what happens when very intelligent systems are acting, and
3. there’s still the problem of inner alignment (and in particular we haven’t clarified utility functions—the way in which the trained system chooses its thinking and its actions to be useful to achieve its goal—which we wouldn’t need if we had the predictor-agent, but that agent is unsafe).
In the real world, these domains aren’t the sort of thing where you get a perfect simulation. The differences will strongly add up when you strongly train an AI to maximize <this thing which was a good predictor of diamonds in the more restricted domain of <the domain, as viewed by the AI that was trained to predict the environment> >.
We are now far from your original objection ” I don’t even know how to make an agent with a clear utility function module”.
Imperfect simulations work just fine—for humans and various DL agents, so for your argument to be correct, you now need to explain how humans can still think and steer the future with imperfect world models, and once you do that you will understand how AI can as well.
We’re not far from there. There’s inferential distance here. Translating my original statement, I’d say: the closest thing to the “utility function module” in the scenario you’re describing here with MuZero, is the concept of predicted diamond and the AI it’s inside of. But then you train another AI to pursue that. And I’m saying, I don’t trust that that new trained AI actually maximizes diamond; and to the point, I don’t have any clarity on how the goals of newly trained AI sit inside it, operate inside it, direct its behavior, etc. And in particular I don’t understand it well enough to have any justified confidence it’ll robustly pursue diamond.
So to be clear there is just one AI, built out of several components: a world model, a planning engine, and a utility function. The world model is learned, but assumed to be learned perfectly (resulting in a functional equivalent of the actual sim physics). The planning engine also can learn action/value estimators for efficiency, but that is not required. The utility function is not learned at all, and is manually coded. So the learning components here can not possibly cause any problems.
Of course that’s just in a sim.
Translating the concept to the real world, there are now 3 possible sources of ‘errors’:
imperfection of the learned world model
imperfect planning (compute bound)
imperfect utility function
My main claim is that approximation error in 1 and 2 (which is inevitable) don’t necessarily bias for strong optimization towards the wrong utility function (and they can’t really).
Ok, but
This could give rise to mesa-optimizers with respect to the score function.
The score function doesn’t know how to score like that. By saying “find the concept of predicted future diamond” you called on the AI’s concept. But why should that concept be so robust that even when you train your step 3 AI to a much higher intelligence than the step 1 AI, it (the concept of predicted diamond) still knows how to score behavior or mechanisms in terms of how much diamond they lead to?
Where exactly does the mesa optimizer come from, how exactly is it working? That’s just a vague boogeyman which simply doesn’t exist in this model. Vague claims of “Ooh but mesa-optimizers” are fully general counterarguments against even perfectly aligned AI (like this) - and are thus meaningless until proven otherwise.
It’s very simple and obvious in this example, because step 1 results in a functional equivalent of the minecraft code, which has a perfect crisp representation of the objects of interest (diamond tools). “Train step 3 to a much higher intelligence than step 1” is meaningless as the output of step 1 is not an agent, it’s just a functional equivalent of the minecraft code.
Step 1 results in a perfect functional sim of minecraft with a perfect diamond tool concept, and step 2 results in a perfect diamond tool counting utility function. Step 3 then results in a perfectly aligned agent (assuming no weird errors in the muzero training). We could alternatively replace step 3 with a simple utility maximizer like AIXI, which would obviously then correctly maximize the correct aligned utility function. Muzero is a more practical approximation of that.
(I think I’m going to tap out because there’s too many different background assumptions we’re making here, sorry; maybe I’ll come back later.… E.g. the “diamond maximizer problem” is about our world, not a world that’s plausibly solvable by something that’s literally MuZero; and so you have to have a system that’s doing complex new interesting things, which aren’t comprehended by the concept you find in your step 1 AI.)
I never said diamond maximizer problem—I said “diamond tool maximizer in minecraft”.
Of course once you have an agent that robustly optimizes a goal in sim, then you can do sim to real transfer—which is guaranteed to work if the sim is accurate enough, and in practice that isn’t some huge theoretical problem. (The blocker on self driving cars is not sim to real transfer, for example—the sims are good enough)
The “interesting new things” that we need here are optimizations of existing concepts.
I said diamond maximizer problem, and then you responded to that talking about this other thing that turned out to be not the diamond maximizer problem.
Actually you said this:
So I described how to make an agent with a clear utility function to maximize diamond tools in minecraft, which is obviously related to the diamond maximizer problem and easier to understand.
If you are actually arguing that you don’t/won’t/can’t understand how to make an agent with a clear utility function module—even after my worked example, not to mention all the successful DL agents to date—unless that somehow solves the ‘diamond maximizer’ problem, then you either aren’t discussing in good faith or the inferential gap here is just too enormous and you should read more DL.
I agree that the the inferential gap here is too big, as noted above; by “agent” I of course mean “the sort of agent that is competent enough to transform the world” which implies things like “can learn new domains by its own steering” which implies that the concept of predicted diamond will have trouble understanding what these new capabilities mean.
The agent I described has the perfect model of it’s environment, and in the limit of compute can construct perfect plans to optimize for diamond tool maximization. So obviously it is the sort of agent that is competent enough to transform its world—there is no other agent more competent.
Learning a new domain (like a different sim environment) would require repeating all the steps.
The concept of predicted diamond doesn’t understand anything, so not sure what you meant there. Perhaps what you meant is that when learning new domains by its own steering, the concept of predicted diamond will need to be relearned. Yes, of course—the steps must be repeated.
Would your point here w.r.t. utility functions be fairly summarizable as the following?
I would agree with that statement.
I think that framing is rather strange, because in the minecraft example the superintelligent diamond tool maximizer doesn’t need to understand code or human language. It simply searches for plans that maximize diamond tools.
But assuming you could ask that question through a suitable interface the SI understood—and given some reasons to trust that giving the correct answers is instrumentally rational for the SI—then yes I agree that should work.
Ok. So yeah, I agree that in the hypothetical, actually being able to ask that question to the SI is the hard part (as opposed, for example, to it being hard for the SI to answer accurately).
My framing is definitely different than yours. The statement, as I framed it, could be interesting, but it doesn’t seem to me to answer the question about utility functions. It doesn’t explain how the code that’s found, actually encodes the idea of diamonds and does its thinking in a way that’s really, thoroughly aimed at making there be diamonds. It does that somehow, and the superintelligence knows how it does that. Be we don’t, so we, unlike the superintelligence, can’t use that analysis to be justifiedly confident that the code will actually lead to diamonds. (We can be justifiedly confident of that by some other route, e.g. because we asked the SI.)
Sure, but at that point you have substituted trust in the code representing the idea of diamonds for trust in a SI aligned to give you the correct code.
Yeah.
Maybe a more central thing to how our views are differing, is that I don’t view training signals as identical to utility functions. They’re obviously somehow related, but they have different roles in systems. So to me changing the training signal obviously will affect the trained system’s goals in some way, but it won’t be identical to the operation of writing some objective to an agent’s utility function, and the non-identicality will become very relevant for a very intelligent system.
Another thing to say, if you like the outer / inner alignment distinction:
1. Yes, if you have an agent that’s competent to predict some feature X of the world “sufficiently well”, and you’re able to extract the agent’s prediction, then you’ve made a lot of progress towards outer alignment for X; but
2. unfortunately your predictor agent is probably dangerous, if it’s able to predict X even when asking about what happens when very intelligent systems are acting, and
3. there’s still the problem of inner alignment (and in particular we haven’t clarified utility functions—the way in which the trained system chooses its thinking and its actions to be useful to achieve its goal—which we wouldn’t need if we had the predictor-agent, but that agent is unsafe).
In the real world, these domains aren’t the sort of thing where you get a perfect simulation. The differences will strongly add up when you strongly train an AI to maximize <this thing which was a good predictor of diamonds in the more restricted domain of <the domain, as viewed by the AI that was trained to predict the environment> >.
We are now far from your original objection ” I don’t even know how to make an agent with a clear utility function module”.
Imperfect simulations work just fine—for humans and various DL agents, so for your argument to be correct, you now need to explain how humans can still think and steer the future with imperfect world models, and once you do that you will understand how AI can as well.
We’re not far from there. There’s inferential distance here. Translating my original statement, I’d say: the closest thing to the “utility function module” in the scenario you’re describing here with MuZero, is the concept of predicted diamond and the AI it’s inside of. But then you train another AI to pursue that. And I’m saying, I don’t trust that that new trained AI actually maximizes diamond; and to the point, I don’t have any clarity on how the goals of newly trained AI sit inside it, operate inside it, direct its behavior, etc. And in particular I don’t understand it well enough to have any justified confidence it’ll robustly pursue diamond.
So to be clear there is just one AI, built out of several components: a world model, a planning engine, and a utility function. The world model is learned, but assumed to be learned perfectly (resulting in a functional equivalent of the actual sim physics). The planning engine also can learn action/value estimators for efficiency, but that is not required. The utility function is not learned at all, and is manually coded. So the learning components here can not possibly cause any problems.
Of course that’s just in a sim.
Translating the concept to the real world, there are now 3 possible sources of ‘errors’:
imperfection of the learned world model
imperfect planning (compute bound)
imperfect utility function
My main claim is that approximation error in 1 and 2 (which is inevitable) don’t necessarily bias for strong optimization towards the wrong utility function (and they can’t really).