I’ve found it useful to organize my thoughts on technical alignment strategies approximately by the AGI architecture (and AGI design related assumptions). The target audience of this text is mostly myself, when I have it better worded I might put it in a post. Mostly ignoring corrigibility-style approaches here.
End-to-end trained RL: Optimizing too hard on any plausible training objective leads to gaming that objective
Scalable oversight: Adjust the objective to make it harder to game, as the agent gets more capable of gaming it.
Problem: Transparency is required to detect gaming/deception, and this problem gets harder as you scale capabilities. Additionally, you need to be careful to scale capabilities slowly.
Shard theory: Agents that can control their data distribution will hit a phase transition where they start controlling their data distribution to avoid goal changes. Plausibly we could make this phase transition happen before deception/gaming, and train it to value honesty + other standard human values.
Problem: Needs to be robust to “improving cognition for efficiency & generality”, i.e. goal directed part of mind overriding heuristic morality part of mind.
Natural abstractions: Most of the work of identifying human values comes from predictive world modeling. This knowledge transfers, such that training the model to pursue human goals is relatively data-efficient.
Failure mode: Can’t require much optimization to get high capabilities, otherwise the capabilities optimization will probably dominate the goal learning, and re-learn most of the goals.
We have an inner aligned scalable optimizer, which will optimize a given precisely defined objective (given a WM & action space).
Vanessa!IRL gives us an approach to detecting and finding the goals of agents modeled inside an incomprehensible ontology.
Unclear whether the way Vanessa!IRL draws a line between irrationalities and goal-quirks is equivalent to the way a human would want this line to be drawn on reflection.
We have human-level imitators
Do some version of HCH
But imitators probably won’t be well aligned off distribution, for normal goal misgeneralization reasons. HCH moves them off distribution, at least a bit.
End-to-end RL + we have unrealistically good interpretability
Re-target the search
Interpretability probably won’t ever be this good, and if it was, we might not need to learn the search algorithm (we could build it from scratch, probably with better inner alignment guarantees).
Technical alignment game tree draft
I’ve found it useful to organize my thoughts on technical alignment strategies approximately by the AGI architecture (and AGI design related assumptions). The target audience of this text is mostly myself, when I have it better worded I might put it in a post. Mostly ignoring corrigibility-style approaches here.
End-to-end trained RL: Optimizing too hard on any plausible training objective leads to gaming that objective
Scalable oversight: Adjust the objective to make it harder to game, as the agent gets more capable of gaming it.
Problem: Transparency is required to detect gaming/deception, and this problem gets harder as you scale capabilities. Additionally, you need to be careful to scale capabilities slowly.
Shard theory: Agents that can control their data distribution will hit a phase transition where they start controlling their data distribution to avoid goal changes. Plausibly we could make this phase transition happen before deception/gaming, and train it to value honesty + other standard human values.
Problem: Needs to be robust to “improving cognition for efficiency & generality”, i.e. goal directed part of mind overriding heuristic morality part of mind.
Transfer learning
Natural abstractions: Most of the work of identifying human values comes from predictive world modeling. This knowledge transfers, such that training the model to pursue human goals is relatively data-efficient.
Failure mode: Can’t require much optimization to get high capabilities, otherwise the capabilities optimization will probably dominate the goal learning, and re-learn most of the goals.
We have an inner aligned scalable optimizer, which will optimize a given precisely defined objective (given a WM & action space).
Vanessa!IRL gives us an approach to detecting and finding the goals of agents modeled inside an incomprehensible ontology.
Unclear whether the way Vanessa!IRL draws a line between irrationalities and goal-quirks is equivalent to the way a human would want this line to be drawn on reflection.
We have human-level imitators
Do some version of HCH
But imitators probably won’t be well aligned off distribution, for normal goal misgeneralization reasons. HCH moves them off distribution, at least a bit.
End-to-end RL + we have unrealistically good interpretability
Re-target the search
Interpretability probably won’t ever be this good, and if it was, we might not need to learn the search algorithm (we could build it from scratch, probably with better inner alignment guarantees).