I don’t exactly agree with your prescription. I called a recent post How to solve deception and still fail, but it could also have been titled “How to do something like RLFH with true oversight on the internal representations and still fail.”
To get the good ending we don’t just need the ability to supervise, we first need an AI that uses sufficiently good meta-preferences (information about how we want to be modeled as having preferences / what we think good reasoning about our preferences looks like) when interpreting human feedback as fintetuning reward.
I skimmed your post, and I think I agree with what you’re saying. However, I think what you’re pointing at is in the same class of problem as deep deceptiveness is.
In my framing, I would put it as a problem of the target you’re actually optimizing for still being underspecified enough to allow models that come up with bad plans you like. I fully agree that trying to figure out whether a plan generated by a superintelligence is good is an incredibly difficult problem to solve, and that if we have to rely on that we probably lose. I don’t see how this applies well to building a preference ordering for the objectives of the AI as opposed to plans generated by it, however. That doesn’t require the same kind of front-loaded inference on figuring out whether a plan would lead to good outcomes, because you’re relying on latent information that’s both immediately descriptive of the model’s internals, and (conditional on a robust enough representation of objectives) isn’t incentivized to be obfuscated to an overseer.
This does still require that you use that preference signal to converge onto a narrow segment of model space where the AI’s objectives are pretty tightly bound with ours, instead of simply deciding whether a given objective is good (which can leave out relevant information, as you say). I don’t think this changes a lot itself on its own though—if you try to do this for evaluations of plans, you lose anyway because the core problem is that your evaluation signal is too underspecified to select between “plans that look good” and “plans that are good” regardless of how you set it up. But I don’t see how the evaluation signal for objectives is similarly underspecified; if your intervention is actually on a robust representation of the internal goal, then it seems to me like the goal that looks the best actually is the best.
That said, I don’t think that the problem of learning a good preference model for objectives is trivial. I think that it’s a much easier problem, though, and that the bulk of the underlying problem lies in being able to oversee the right internal representations.
Very clearly written!
I don’t exactly agree with your prescription. I called a recent post How to solve deception and still fail, but it could also have been titled “How to do something like RLFH with true oversight on the internal representations and still fail.”
To get the good ending we don’t just need the ability to supervise, we first need an AI that uses sufficiently good meta-preferences (information about how we want to be modeled as having preferences / what we think good reasoning about our preferences looks like) when interpreting human feedback as fintetuning reward.
Thanks :)
I skimmed your post, and I think I agree with what you’re saying. However, I think what you’re pointing at is in the same class of problem as deep deceptiveness is.
In my framing, I would put it as a problem of the target you’re actually optimizing for still being underspecified enough to allow models that come up with bad plans you like. I fully agree that trying to figure out whether a plan generated by a superintelligence is good is an incredibly difficult problem to solve, and that if we have to rely on that we probably lose. I don’t see how this applies well to building a preference ordering for the objectives of the AI as opposed to plans generated by it, however. That doesn’t require the same kind of front-loaded inference on figuring out whether a plan would lead to good outcomes, because you’re relying on latent information that’s both immediately descriptive of the model’s internals, and (conditional on a robust enough representation of objectives) isn’t incentivized to be obfuscated to an overseer.
This does still require that you use that preference signal to converge onto a narrow segment of model space where the AI’s objectives are pretty tightly bound with ours, instead of simply deciding whether a given objective is good (which can leave out relevant information, as you say). I don’t think this changes a lot itself on its own though—if you try to do this for evaluations of plans, you lose anyway because the core problem is that your evaluation signal is too underspecified to select between “plans that look good” and “plans that are good” regardless of how you set it up. But I don’t see how the evaluation signal for objectives is similarly underspecified; if your intervention is actually on a robust representation of the internal goal, then it seems to me like the goal that looks the best actually is the best.
That said, I don’t think that the problem of learning a good preference model for objectives is trivial. I think that it’s a much easier problem, though, and that the bulk of the underlying problem lies in being able to oversee the right internal representations.