Isn’t this just saying it would be nice if we collectively put more resources towards alignment research relative to capabilities research? I still feel like I’m missing something :/
We may be able to offload some work to the system, e.g. by having it search for a diverse range of models for the user’s intent, instead of making it use a single hardcoded goal specification.
This comment of mine is a bit related if you want more elaboration:
Isn’t this just saying it would be nice if we collectively put more resources towards alignment research relative to capabilities research? I still feel like I’m missing something :/
We may be able to offload some work to the system, e.g. by having it search for a diverse range of models for the user’s intent, instead of making it use a single hardcoded goal specification.
This comment of mine is a bit related if you want more elaboration:
https://www.lesswrong.com/posts/NtX7LKhCXMW2vjWx6/thoughts-on-reward-engineering#jJ7nng3AGmtAWfxsy
If you have thoughts on it, probably best to reply there—we are already necroposting, so let’s keep the discussion organized.