It seems like this problem has an obvious solution.
Instead of building your process like this
optimize for good agent → predict what they will say → predict what they will say → … ->
Build your process like this
optimize for good agent → predict what they will say → optimize for good agent → predict what they will say → optimize for good agent → predict what they will say → …
If there’s some space of “Luigis” that we can identify (e.g. with RLHF) surrounded by some larger space of “Waluigis”, just apply optimization pressure at every step to make sure we stay in the “Luigi” space instead of letting the process wander out into the Waluigi space.
Note that the Bing “fix” of not allowing more than 6 replies partially implements this by giving a fresh start in the “Luigi” space periodically.
It seems like this problem has an obvious solution.
Instead of building your process like this
Build your process like this
If there’s some space of “Luigis” that we can identify (e.g. with RLHF) surrounded by some larger space of “Waluigis”, just apply optimization pressure at every step to make sure we stay in the “Luigi” space instead of letting the process wander out into the Waluigi space.
Note that the Bing “fix” of not allowing more than 6 replies partially implements this by giving a fresh start in the “Luigi” space periodically.