It seems fairly tautological that putting more compute towards maximizing a goal specification than towards making sure the goal specification is what we want is likely to lead to UFAI
And the “relatively simple” solution is to do the reverse, and put more compute towards making sure the goal specification is what we want than towards maximizing it.
(It’s possible this point isn’t very related to what Wei Dai said.)
Isn’t this just saying it would be nice if we collectively put more resources towards alignment research relative to capabilities research? I still feel like I’m missing something :/
We may be able to offload some work to the system, e.g. by having it search for a diverse range of models for the user’s intent, instead of making it use a single hardcoded goal specification.
This comment of mine is a bit related if you want more elaboration:
And the “relatively simple” solution is to do the reverse, and put more compute towards making sure the goal specification is what we want than towards maximizing it.
(It’s possible this point isn’t very related to what Wei Dai said.)
Isn’t this just saying it would be nice if we collectively put more resources towards alignment research relative to capabilities research? I still feel like I’m missing something :/
We may be able to offload some work to the system, e.g. by having it search for a diverse range of models for the user’s intent, instead of making it use a single hardcoded goal specification.
This comment of mine is a bit related if you want more elaboration:
https://www.lesswrong.com/posts/NtX7LKhCXMW2vjWx6/thoughts-on-reward-engineering#jJ7nng3AGmtAWfxsy
If you have thoughts on it, probably best to reply there—we are already necroposting, so let’s keep the discussion organized.