An optimizer is a very advanced meta-learning algorithm that can learn the rules of (effectively) any environment and perform well in it. It’s general by definition. It’s efficient because this generality allows it to use maximally efficient internal representations of its environment.
I certainly agree that we’ll build mesa-optimizers under this definition of “optimizer”. What then causes them to be goal-directed, i.e. what causes them to choose what actions to take by considering a large possible space of plans that includes “kill all the humans”, predicting their consequences in the real world, and selecting the action to take based on how the predicted consequences are rated by some metric? Or if they may not be goal-directed according to the definition I gave there, why will they end the world?
For an optimizer to operate well in any environment, it needs some metric by which to evaluate its performance in that environment. How would it converge towards optimal performance otherwise, how would it know to prefer e. g. walking to random twitching? In other words, it needs to keep track of what it wants to do given any environment.
Suppose we have some initial “ground-level” environment, and a goal defined over it. If the optimizer wants to build a resource-efficient higher-level model of it, it needs to translate that goal into its higher-level representation (e. g., translating “this bundle of atoms” into “this dot”, as in my Solar System simulation example below). In other words, such an optimizer would have the ability to redefine its initial goal in terms of any environment it finds itself operating in.
Now, it’s not certain that e. g. a math engine would necessarily decide to prioritize the real world, designate it the “real” environment it needs to achieve goals in. But:
If it does decide that, it’d have the ability and the desire to Kill All the Humans. It would be able to define its initial goal in terms of the real world, and, assuming superintelligence, it’d have the general competence to learn to play the real-world games better than us.
In some way, it seems “correct” for it to decide that. At the very least, to perform a lasting reward-hack and keep its loss minimized forever.
In a comment below, you define an optimizer as:
I certainly agree that we’ll build mesa-optimizers under this definition of “optimizer”. What then causes them to be goal-directed, i.e. what causes them to choose what actions to take by considering a large possible space of plans that includes “kill all the humans”, predicting their consequences in the real world, and selecting the action to take based on how the predicted consequences are rated by some metric? Or if they may not be goal-directed according to the definition I gave there, why will they end the world?
For an optimizer to operate well in any environment, it needs some metric by which to evaluate its performance in that environment. How would it converge towards optimal performance otherwise, how would it know to prefer e. g. walking to random twitching? In other words, it needs to keep track of what it wants to do given any environment.
Suppose we have some initial “ground-level” environment, and a goal defined over it. If the optimizer wants to build a resource-efficient higher-level model of it, it needs to translate that goal into its higher-level representation (e. g., translating “this bundle of atoms” into “this dot”, as in my Solar System simulation example below). In other words, such an optimizer would have the ability to redefine its initial goal in terms of any environment it finds itself operating in.
Now, it’s not certain that e. g. a math engine would necessarily decide to prioritize the real world, designate it the “real” environment it needs to achieve goals in. But:
If it does decide that, it’d have the ability and the desire to Kill All the Humans. It would be able to define its initial goal in terms of the real world, and, assuming superintelligence, it’d have the general competence to learn to play the real-world games better than us.
In some way, it seems “correct” for it to decide that. At the very least, to perform a lasting reward-hack and keep its loss minimized forever.