Suggestion for content 1: relationship to ordinary distribution shift problems
When I mention inner alignment to ML researchers, they often think of it as an ordinary problem of (covariate) distribution shift.
My suggestion is to discuss if a solution to ordinary distribution shift is also a solution to inner alignment. E.g. an ‘ordinary’ robustness problem for imitation learning could be handled safely with an approach similar to Michael’s: maintain a posterior over hypotheses p(h|x1:t,y1:t), with a sufficiently flexible hypothesis class, and ask for help whenever the model is uncertain about the output y for a new input x.
One interesting subtopic is whether inner alignment is an extra-ordinary robustness problem because it is adversarial: even the tiniest difference between train and test inputs might cause the model to misbehave. (See also this.)
Suggestion for content 1: relationship to ordinary distribution shift problems
When I mention inner alignment to ML researchers, they often think of it as an ordinary problem of (covariate) distribution shift.
My suggestion is to discuss if a solution to ordinary distribution shift is also a solution to inner alignment. E.g. an ‘ordinary’ robustness problem for imitation learning could be handled safely with an approach similar to Michael’s: maintain a posterior over hypotheses p(h|x1:t,y1:t), with a sufficiently flexible hypothesis class, and ask for help whenever the model is uncertain about the output y for a new input x.
One interesting subtopic is whether inner alignment is an extra-ordinary robustness problem because it is adversarial: even the tiniest difference between train and test inputs might cause the model to misbehave. (See also this.)