A central concept I got out of Reframing impact is that instrumental convergence can be useful for shaping the motivations of superintelligent agents. i.e. there are two frames for thinking about instrumental convergence, one negative and one positive.
Instrumental convergence means most agents will want to take power, this is a safety problem.
Instrumental convergence means we can often predict the motivations of agents with arbitrary utility functions.
The discussion seems to center around the former negative frame, but the positive frame is useful too! Ideally, it would be instrumentally convergent (in some sense) for the AI to do the thing we want, then we’d have a nice basin of safety.
Toy example of this framing generating interesting ideas: The following exercise
Using a hypercomputer, create an AGI which takes in some data, builds a world model, and then can communicate with a copy of itself (trained on potentially different data from the same environment) to coordinate on a choice of one object in the environment which is not directly visible to both of their sensors.
Can be totally solved now (with runnable code; I claim) by
Creating a videogame where the two copies of the AGI (e.g. AIXI) communicate, then individually pick an object (via an object specification language). If they pick the same object they “win” and are released into separate simulated universes to do whatever, if they pick different objects they “die” (i.e. lose power, and can’t take further actions).
Even though we can’t solve inner alignment on the agents, they’ll still want to coordinate to preserve optionality/seek power. As long as we don’t look into the simulation (the agents will hack you to escape in order to gain more computational resources from base reality) and you prove code correctness. Hardware exploits can still screw you but ignoring them this works.
(If you don’t agree this can be turned into code ask me specifically about parts. I have ugly pseudocode equivalent to the description above, which I’m confident can be turned into runnable code (on a hypercomputer) by adding AIXI, seeds (like the game-of-life) for the universes to release the AI into, etc.)
[ASoT] Instrumental convergence is useful
A central concept I got out of Reframing impact is that instrumental convergence can be useful for shaping the motivations of superintelligent agents. i.e. there are two frames for thinking about instrumental convergence, one negative and one positive.
Instrumental convergence means most agents will want to take power, this is a safety problem.
Instrumental convergence means we can often predict the motivations of agents with arbitrary utility functions.
The discussion seems to center around the former negative frame, but the positive frame is useful too! Ideally, it would be instrumentally convergent (in some sense) for the AI to do the thing we want, then we’d have a nice basin of safety.
Toy example of this framing generating interesting ideas: The following exercise
Can be totally solved now (with runnable code; I claim) by
Creating a videogame where the two copies of the AGI (e.g. AIXI) communicate, then individually pick an object (via an object specification language). If they pick the same object they “win” and are released into separate simulated universes to do whatever, if they pick different objects they “die” (i.e. lose power, and can’t take further actions).
Even though we can’t solve inner alignment on the agents, they’ll still want to coordinate to preserve optionality/seek power. As long as we don’t look into the simulation (the agents will hack you to escape in order to gain more computational resources from base reality) and you prove code correctness. Hardware exploits can still screw you but ignoring them this works.
(If you don’t agree this can be turned into code ask me specifically about parts. I have ugly pseudocode equivalent to the description above, which I’m confident can be turned into runnable code (on a hypercomputer) by adding AIXI, seeds (like the game-of-life) for the universes to release the AI into, etc.)