we know how to specify rewards for… “A human approved this output”; we don’t know how to specify rewards for “Actually good alignment research”.
Can’t these be the same thing? If we have humans who can identify actually good alignment research, we can sit them down in the RLHF booth and have the AI try to figure out how to make them happy.
Now obviously a sufficiently clever AI will infer the existence of the RLHF booth and start hacking the human in order to escape its box, which would be bad for alignment research. But it’s looking increasingly plausible that e.g. GPT-6 will be smart enough to provide actually good mathematical research without being smart enough to take over the world (that doesn’t happen until GPT-8). So why not alignment research?
To break the comparison I think you need to posit either that alignment research is way harder than math research (as Eli understands Eliezer does) such that anything smart enough to do it is also smart enough to hack a human, or I suppose it could be the case that we don’t have humans who can identify actually good alignment research.
Can’t these be the same thing? If we have humans who can identify actually good alignment research, we can sit them down in the RLHF booth and have the AI try to figure out how to make them happy.
Now obviously a sufficiently clever AI will infer the existence of the RLHF booth and start hacking the human in order to escape its box, which would be bad for alignment research. But it’s looking increasingly plausible that e.g. GPT-6 will be smart enough to provide actually good mathematical research without being smart enough to take over the world (that doesn’t happen until GPT-8). So why not alignment research?
To break the comparison I think you need to posit either that alignment research is way harder than math research (as Eli understands Eliezer does) such that anything smart enough to do it is also smart enough to hack a human, or I suppose it could be the case that we don’t have humans who can identify actually good alignment research.