I think that is a good post, and strongly agree with most of it. I do think though that the role of RLHF or RL fine-tuning in general is under emphasized. My fear isn’t that the Predictor by itself will spawn a super agent, even due to a very special prompt.
My fear is that it may learn good enough biases that RL can push it significantly beyond human level. That it may take the strengths of different humans and combine them. That it wouldn’t be like imitating the smartest person, but a cognitive version of “create a super-human by choosing genes carefully from just the existing genetic variation”—which I strongly suspect is possible. Imagine that there are 20 core cognitive problems that you learn to solve in childhood, and that you may learn better or worse algorithms to solve them. Imagine Feynman got 12 of them right, and Hitler got his power because he got other 5 of them right. If the RL can therefore be like a human who got 17 right, that’s a big problem. If it can extrapolate what a human would be like if its working memory was just twice larger—a big problem.
To be honest, I do not expect RLHF to do that. “Do the thing that make people press like” don’t seem to me like an ambitious enough problem to unlock much (the buried Predictor may extrapolate from short-term competence to an ambitious musk though). But if that is true, someone will eventually be tempted to be more… creative about the utility function. I don’t think you can train it on “maximize Microsoft share value” yet, but I expect it to be possible in decade or two, and Maan be less for some other dangerous utility.
I think that is a good post, and strongly agree with most of it. I do think though that the role of RLHF or RL fine-tuning in general is under emphasized. My fear isn’t that the Predictor by itself will spawn a super agent, even due to a very special prompt.
My fear is that it may learn good enough biases that RL can push it significantly beyond human level. That it may take the strengths of different humans and combine them. That it wouldn’t be like imitating the smartest person, but a cognitive version of “create a super-human by choosing genes carefully from just the existing genetic variation”—which I strongly suspect is possible. Imagine that there are 20 core cognitive problems that you learn to solve in childhood, and that you may learn better or worse algorithms to solve them. Imagine Feynman got 12 of them right, and Hitler got his power because he got other 5 of them right. If the RL can therefore be like a human who got 17 right, that’s a big problem. If it can extrapolate what a human would be like if its working memory was just twice larger—a big problem.
To be honest, I do not expect RLHF to do that. “Do the thing that make people press like” don’t seem to me like an ambitious enough problem to unlock much (the buried Predictor may extrapolate from short-term competence to an ambitious musk though). But if that is true, someone will eventually be tempted to be more… creative about the utility function. I don’t think you can train it on “maximize Microsoft share value” yet, but I expect it to be possible in decade or two, and Maan be less for some other dangerous utility.