Yeah, all of that seems right to me (and I feel like I have a better understanding of why assumptions on inputs are better than assumptions on outputs, which was more like a vague intuition before). I’ve changed the opinion to:
I like the low-stakes assumption as a way of saying “let’s ignore distributional shift for now”. Probably the most salient alternative is something along the lines of “assume that the AI system is trying to optimize the true reward function”. The main way that low-stakes alignment is cleaner is that it uses an assumption on the _environment_ (an input to the problem) rather than an assumption on the _AI system_ (an output of the problem). This seems to be a lot nicer because it is harder to “unfairly” exploit a not-too-strong assumption on an input rather than on an output. See [this comment thread](https://www.alignmentforum.org/posts/TPan9sQFuPP6jgEJo/low-stakes-alignment?commentId=askebCP36Ce96ZiJa) for more discussion.
Yeah, all of that seems right to me (and I feel like I have a better understanding of why assumptions on inputs are better than assumptions on outputs, which was more like a vague intuition before). I’ve changed the opinion to: