Thanks for your comment! I didn’t get around to answering earlier, but maybe it’s still useful to try to clarify a few things.
If I understand the setup correctly, there’s no guarantee that the optimal element would be good, right?
My threat model here is that we have access to an Oracle that’s not trustworthy (as specified in the first paragraph), so that even if we were able to specify our preferences correctly, we would still have a problem. So in this context you could assume that we managed to specify our preferences correctly. If our problem is simply that we misspecified our preferences (this would roughly correspond to “fail at outer alignment” vs my threat model of “fail at inner alignment), solving this by soft-maximization is much easier: Just “flatten” the top part of your utility function (i.e. make all outputs that satisfice have the same utility) and add some noise, then hand it to an optimizer.
So I guess the point I tried to make with my post could be stated as “Soft-optimization can also be used to help with inner alignment, not just outer alignment” (and I freely admit that I should have said so in the post).
It’s just likely since the optimal element a priori shouldn’t be unusually bad, and you’re assuming most satisficing elements are fine.
I’m not just assuming that, I gave a bit of an argument for why I believe that to be that case: I assume that an output is not randomly dangerous, but only dangerous if it was specifically chosen to achieve a goal that differs from ours. This only holds if the goal is not too ambitious, e.g. if we ask the AI controlling the paperclip factory for 10^30 paperclips that will not go well.
As I understand, gwern has a stronger opinion on that and believes that side-effects from less ambitious plans are still an important concern. But at least my proposal gets rid of the potentially adversarial selection by the untrusted optimizer, so I believe that getting safe outputs (e.g. by combining our goal with impact regularization) should be much easier.
I’m still unsure about how useful this problem setup is. For example, we’d probably want to train the weakest system that can give us satisficing outputs (rather than having an infinitely intelligent oracle).
My assumption about the black-box untrusted optimizer is a worst-case assumption. So I’m trying to understand if we could get something useful out of the system even in that case. If we can make stronger assumptions about what we know about the system, the problem gets of course much easier. Given the general state of progress in interpretability compared with progress in capabilities, I think it’s worth thinking about this worst-case scenario a bit.
I like your suggestion of “train the weakest system that can give us satisficing outputs”, but I think doing that might actually be quite difficult: How do we measure (or otherwise predict) how weak/strong our system and how easy/difficult the satisficing task is? Can these things even be measured on a 1-d scale? Trying to increase to capability slowly until the system solves the task runs into the problem mentioned in footnote 10 if the systems deceptively pretends to be weaker than it is.
Almost certainly? That’s a bit too confident for my taste.