So my feeling is that in order to actually implement an AI that does not cause bad kinds of high impact, we would need to make progress on value learning
Optimizing for a ‘slightly off’ utility function might be catastrophic, and therefore the margin for error for value learning could be narrow. However, it seems plausible that if your impact measurement used slightly incorrect utility functions to define the auxiliary set, this would not cause a similar error. Thus, it seems intuitive to me that you would need less progress on value learning than a full solution for impact measures to work.
one of our key findings is that AUP tends to preserve the ability to optimize the correct reward function even when the correct reward function is not included in the auxiliary set.
I appreciate this clarification, but when I wrote my comment, I hadn’t read the original AUP post or the paper, since I assumed this sequence was supposed to explain AUP starting from scratch (so I didn’t have the idea of auxiliary set when I wrote my comment).
It is meant to explain starting from scratch, so no worries! To clarify, although I agree with Matthew’s comment, I’ll later explain how value learning (or progress therein) is unnecessary for the approach I think is most promising.
Optimizing for a ‘slightly off’ utility function might be catastrophic, and therefore the margin for error for value learning could be narrow. However, it seems plausible that if your impact measurement used slightly incorrect utility functions to define the auxiliary set, this would not cause a similar error. Thus, it seems intuitive to me that you would need less progress on value learning than a full solution for impact measures to work.
From the AUP paper,
I appreciate this clarification, but when I wrote my comment, I hadn’t read the original AUP post or the paper, since I assumed this sequence was supposed to explain AUP starting from scratch (so I didn’t have the idea of auxiliary set when I wrote my comment).
It is meant to explain starting from scratch, so no worries! To clarify, although I agree with Matthew’s comment, I’ll later explain how value learning (or progress therein) is unnecessary for the approach I think is most promising.