For reference and ease of quoting, this comment is a text only version of the post above. (It starts at “Text:” below.) I am not the OP.
Formatting:
It’s not clear how to duplicate the color effect* or cross words out**, so that hasn’t been done. Instead crossed out words are followed by ”? (No.)”, and here’s a list of some words by color to refresh the color/concept relation:
incentives/actions/(reward)/expected utility/complicated human value/tasks
Text:
Last time on reframing impact:
(CCC)
Catastrophic Convergence Conjecture:
Unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives
If the CCC is right, then if power gain is disincentivised, the agent isn’t incentivised to overfit and disrupt our AU landscape.
Without even knowing who we are or what we want, the agent’s actions preserve our attainable utilities.
We can tell it:
Make paperclips
or
Put that strawberry on the plate
or
Paint the car pink
...
but don’t gain power.
This approach is called Attainable Utility preservation
We’re focusing on concepts in this post. For now, imagine an agent receiving a reward for a primary task minus a scaled penalty for how much it’s actions change its power (in the intuitive sense). This is AUP_conceptual, not any formalization you may be familiar with.
What might a paperclip-manufacturing AUP_conceptual agent do?
Build lots of factories? (No.)
Copy itself? (No.)
Nothing? (No.)
Narrowly improve paperclip production efficiency ← This is the kind of policy AUP_conceptual is designed to encourage and allow. We don’t know if this is the optimal policy, but by CCC, the optimal policy won’t be catastrophic.
AUP_conceptual dissolves thorny issues in impact measurement.
Is the agent’s ontology reasonable?
Who cares.
Instead of regulating its complex physical effects on the outside world,
the agent is looking inwards at itself and its own abilities.
How do we ensure the impact penalty isn’t dominated by distant state changes?
Imagine I take over a bunch of forever inaccessible stars and jumble them up. This is a huge change in state, but it doesn’t matter to us.
AUP_conceptual solves this “locality” problem by regularizing the agent’s impact on the nearby AU landscape.
What about butterfly effects?
How can the agent possibly determine which effects its responsible for?
Forget about it.
AUP_conceptual agents are respectful and conservative with respect to the local AUP landscape without needing to assume anything about its structure or the agents in it.
How can an idea go wrong?
There can be a gap between what we want and the concept, and then a gap between the concept and the execution.
For past-impact measures, it’s not clear that their conceptual thrusts are well-aimed, even if we could formalize everything correctly. Past approaches focus either on minimizing physical change to some aspect of the world or on maintaining ability to reach many world states.
The hope is that in order for the agent to cause a large impact on us it has to snap a tripwire.
The problem is… well it’s not clear how we could possibly know whether the agent can still find a catastrophic policy; in a sense the agent is still trying to sneak by the restrictions and gain power over us. An agent maximizing expected utility while actually minimally changing still probably leads to catastrophe.
That doesn’t seem to be the case for AUP_conceptual.
Assuming CCC, an agent which doesn’t gain much power, doesn’t cause catastrophes. This has no dependency on complicated human value, and most realistic tasks should have reasonable, high-reward policies not gaining undue power.
So AUP_conceptual meets our desiderata:
The distance measure should:
1) Be easy to specify
2) Put catastrophes far away.
3) Put reasonable plans nearby
Therefore, I consider AUP to conceptually be a solution to impact measurement.
Wait! Let’s not get ahead of ourselves! I don’t think we’ve fully bridged the concept/execution gap.
However for AUP, it seems possible—more on that later.
Thanks for doing this. I was originally going to keep a text version of the whole sequence, but I ended up making lots of final edits in the images, and this sequence has already taken an incredible amount of time on my part.
For reference and ease of quoting, this comment is a text only version of the post above. (It starts at “Text:” below.) I am not the OP.
Formatting:
It’s not clear how to duplicate the color effect* or cross words out**, so that hasn’t been done. Instead crossed out words are followed by ”? (No.)”, and here’s a list of some words by color to refresh the color/concept relation:
Blue words:
Power/impact/penalty/importance/respect/conservative/catastrophic/distance measure/impact measurement
Purple words:
incentives/actions/(reward)/expected utility/complicated human value/tasks
Text:
Last time on reframing impact:
(CCC)
Catastrophic Convergence Conjecture:
Unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives
If the CCC is right, then if power gain is disincentivised, the agent isn’t incentivised to overfit and disrupt our AU landscape.
Without even knowing who we are or what we want, the agent’s actions preserve our attainable utilities.
We can tell it:
Make paperclips
or
Put that strawberry on the plate
or
Paint the car pink
...
but don’t gain power.
This approach is called Attainable Utility preservation
We’re focusing on concepts in this post. For now, imagine an agent receiving a reward for a primary task minus a scaled penalty for how much it’s actions change its power (in the intuitive sense). This is AUP_conceptual, not any formalization you may be familiar with.
What might a paperclip-manufacturing AUP_conceptual agent do?
Build lots of factories? (No.)
Copy itself? (No.)
Nothing? (No.)
Narrowly improve paperclip production efficiency ← This is the kind of policy AUP_conceptual is designed to encourage and allow. We don’t know if this is the optimal policy, but by CCC, the optimal policy won’t be catastrophic.
AUP_conceptual dissolves thorny issues in impact measurement.
Is the agent’s ontology reasonable?
Who cares.
Instead of regulating its complex physical effects on the outside world,
the agent is looking inwards at itself and its own abilities.
How do we ensure the impact penalty isn’t dominated by distant state changes?
Imagine I take over a bunch of forever inaccessible stars and jumble them up. This is a huge change in state, but it doesn’t matter to us.
AUP_conceptual solves this “locality” problem by regularizing the agent’s impact on the nearby AU landscape.
What about butterfly effects?
How can the agent possibly determine which effects its responsible for?
Forget about it.
AUP_conceptual agents are respectful and conservative with respect to the local AUP landscape without needing to assume anything about its structure or the agents in it.
How can an idea go wrong?
There can be a gap between what we want and the concept, and then a gap between the concept and the execution.
For past-impact measures, it’s not clear that their conceptual thrusts are well-aimed, even if we could formalize everything correctly. Past approaches focus either on minimizing physical change to some aspect of the world or on maintaining ability to reach many world states.
The hope is that in order for the agent to cause a large impact on us it has to snap a tripwire.
The problem is… well it’s not clear how we could possibly know whether the agent can still find a catastrophic policy; in a sense the agent is still trying to sneak by the restrictions and gain power over us. An agent maximizing expected utility while actually minimally changing still probably leads to catastrophe.
That doesn’t seem to be the case for AUP_conceptual.
Assuming CCC, an agent which doesn’t gain much power, doesn’t cause catastrophes. This has no dependency on complicated human value, and most realistic tasks should have reasonable, high-reward policies not gaining undue power.
So AUP_conceptual meets our desiderata:
The distance measure should:
1) Be easy to specify
2) Put catastrophes far away.
3) Put reasonable plans nearby
Therefore, I consider AUP to conceptually be a solution to impact measurement.
Wait! Let’s not get ahead of ourselves! I don’t think we’ve fully bridged the concept/execution gap.
However for AUP, it seems possible—more on that later.
Thanks for doing this. I was originally going to keep a text version of the whole sequence, but I ended up making lots of final edits in the images, and this sequence has already taken an incredible amount of time on my part.