For an optimizer to operate well in any environment, it needs some metric by which to evaluate its performance in that environment. How would it converge towards optimal performance otherwise, how would it know to prefer e. g. walking to random twitching? In other words, it needs to keep track of what it wants to do given any environment.
Suppose we have some initial “ground-level” environment, and a goal defined over it. If the optimizer wants to build a resource-efficient higher-level model of it, it needs to translate that goal into its higher-level representation (e. g., translating “this bundle of atoms” into “this dot”, as in my Solar System simulation example below). In other words, such an optimizer would have the ability to redefine its initial goal in terms of any environment it finds itself operating in.
Now, it’s not certain that e. g. a math engine would necessarily decide to prioritize the real world, designate it the “real” environment it needs to achieve goals in. But:
If it does decide that, it’d have the ability and the desire to Kill All the Humans. It would be able to define its initial goal in terms of the real world, and, assuming superintelligence, it’d have the general competence to learn to play the real-world games better than us.
In some way, it seems “correct” for it to decide that. At the very least, to perform a lasting reward-hack and keep its loss minimized forever.
For an optimizer to operate well in any environment, it needs some metric by which to evaluate its performance in that environment. How would it converge towards optimal performance otherwise, how would it know to prefer e. g. walking to random twitching? In other words, it needs to keep track of what it wants to do given any environment.
Suppose we have some initial “ground-level” environment, and a goal defined over it. If the optimizer wants to build a resource-efficient higher-level model of it, it needs to translate that goal into its higher-level representation (e. g., translating “this bundle of atoms” into “this dot”, as in my Solar System simulation example below). In other words, such an optimizer would have the ability to redefine its initial goal in terms of any environment it finds itself operating in.
Now, it’s not certain that e. g. a math engine would necessarily decide to prioritize the real world, designate it the “real” environment it needs to achieve goals in. But:
If it does decide that, it’d have the ability and the desire to Kill All the Humans. It would be able to define its initial goal in terms of the real world, and, assuming superintelligence, it’d have the general competence to learn to play the real-world games better than us.
In some way, it seems “correct” for it to decide that. At the very least, to perform a lasting reward-hack and keep its loss minimized forever.