(On reflection this comment is less kind than I’d like it to be, but I’m leaving it as-is because I think it is useful to record my knee-jerk reaction. It’s still a good post; I apologize in advance for not being very nice.)
In theory, such an agent is safe because a human would only approve safe actions.
… wat.
Lol no.
Look, I understand that outer alignment is orthogonal to the problem this post is about, but like… say that. Don’t just say that a very-obviously-unsafe thing is safe. (Unless this is in fact nonobvious, in which case I will retract this comment and give a proper explanation.)
Yeah, you’re right that it’s obviously unsafe. The words “in theory” were meant to gesture at that, but it could be much better worded. Changed to “A prototypical example is a time-limited myopic approval-maximizing agent. In theory, such an agent has some desirable safety properties because a human would only approve safe actions (although we still would consider it unsafe).”
You beat me to making this comment :P Except apparently I came here to make this comment about the changed version.
“A human would only approve safe actions” is just a problem clause altogether. I understand how this seems reasonable for sub-human optimizers, but if you (now addressing Mark and Evan) think it has any particular safety properties for superhuman optimization pressure, the particulars of that might be interesting to nail down a bit better.
(On reflection this comment is less kind than I’d like it to be, but I’m leaving it as-is because I think it is useful to record my knee-jerk reaction. It’s still a good post; I apologize in advance for not being very nice.)
… wat.
Lol no.
Look, I understand that outer alignment is orthogonal to the problem this post is about, but like… say that. Don’t just say that a very-obviously-unsafe thing is safe. (Unless this is in fact nonobvious, in which case I will retract this comment and give a proper explanation.)
Yeah, you’re right that it’s obviously unsafe. The words “in theory” were meant to gesture at that, but it could be much better worded. Changed to “A prototypical example is a time-limited myopic approval-maximizing agent. In theory, such an agent has some desirable safety properties because a human would only approve safe actions (although we still would consider it unsafe).”
You beat me to making this comment :P Except apparently I came here to make this comment about the changed version.
“A human would only approve safe actions” is just a problem clause altogether. I understand how this seems reasonable for sub-human optimizers, but if you (now addressing Mark and Evan) think it has any particular safety properties for superhuman optimization pressure, the particulars of that might be interesting to nail down a bit better.
has been changed to imitation, as suggested by Evan.
Yeah, I agree—the example should probably just be changed to be about an imitative amplification agent or something instead.