kave comments on Predictive model agents are sort of corrigible

kave 8 Jan 2024 21:34 UTC
1 point
0
The point of my setup is that [ P(outcome|corrigible action) ] is very small, so [ P(incorrigible action|outcome) ] is largeish, even if [ Frequency(corrigible action) ] is high and [ Frequency(incorrigible action) ] is low or absent.
And this is alignment relevant, because I expect people will ask for never before seen outcomes (by chance or on purpose), some of which may soft-require incorrigible actions.
(And of course there could be optimisation daemons that do treacherous turns even when asking for normal actions. But I think your post is setting that aside, which seems reasonable).