To be honest, I do not expect RLHF to do that. “Do the thing that make people press like” don’t seem to me like an ambitious enough problem to unlock much (the buried Predictor may extrapolate from short-term competence to an ambitious musk though). But if that is true, someone will eventually be tempted to be more… creative about the utility function. I don’t think you can train it on “maximize Microsoft share value” yet, but I expect it to be possible in decade or two, and Maan be less for some other dangerous utility.
To be honest, I do not expect RLHF to do that. “Do the thing that make people press like” don’t seem to me like an ambitious enough problem to unlock much (the buried Predictor may extrapolate from short-term competence to an ambitious musk though). But if that is true, someone will eventually be tempted to be more… creative about the utility function. I don’t think you can train it on “maximize Microsoft share value” yet, but I expect it to be possible in decade or two, and Maan be less for some other dangerous utility.