For an alignment proposal you can ask about where value judgement ultimately bottoms out, and of course in this case at some point it’s a human/humans in the loop. This reminds me of a discussion by Rohin Shah about a distinction one can draw between ML alignment proposals: those which load value information ‘all at once’ (pre-deploy) and those which (are able to) incrementally provide value feedback at runtime.
I think naively interpreted, RAT looks like it’s trying to load value ‘all at once’. This seems really hard for the poor human(s) having to make value judgements about future incomprehensible worlds, even if they have access to powerful assistance! But perhaps not?
e.g. perhaps one of the more important desiderata for ‘acceptability’ is that it only includes behaviour which is responsive (in the right ways!) to ongoing feedback (of one form or another)?
For an alignment proposal you can ask about where value judgement ultimately bottoms out, and of course in this case at some point it’s a human/humans in the loop. This reminds me of a discussion by Rohin Shah about a distinction one can draw between ML alignment proposals: those which load value information ‘all at once’ (pre-deploy) and those which (are able to) incrementally provide value feedback at runtime.
I think naively interpreted, RAT looks like it’s trying to load value ‘all at once’. This seems really hard for the poor human(s) having to make value judgements about future incomprehensible worlds, even if they have access to powerful assistance! But perhaps not?
e.g. perhaps one of the more important desiderata for ‘acceptability’ is that it only includes behaviour which is responsive (in the right ways!) to ongoing feedback (of one form or another)?