This sounds like both an alignment and a capabilities problem.
I’d be worried about leaning too much on this assumption. My assumption is that “paper over this enough to get meaningful work” is a strictly easier problem than “robustly solve the actual problem”. I.e. imagine you have a model that is blatantly reward hacking a non-negligible amount of the time, but it’s really useful. It’s hard to make the argument that people aren’t getting meaningful work out of o3 or sonnet 3.7, and impossible to argue they’re aligned here. As capabilities increase, even if this gets worse, the models will get more useful, so by default we’ll tolerate more of it. Models have a “misalignment vs usefulness” tradeoff they can make.
I think it’s hard to get a useful model for reasons related to the blatant reward hacking—the difficulty of RL on long horizon tasks without a well-defined reward signal.
This sounds like both an alignment and a capabilities problem.
AI 2027-style takeoffs do not look plausible when you can’t extract reliable work from models.
I’d be worried about leaning too much on this assumption. My assumption is that “paper over this enough to get meaningful work” is a strictly easier problem than “robustly solve the actual problem”. I.e. imagine you have a model that is blatantly reward hacking a non-negligible amount of the time, but it’s really useful. It’s hard to make the argument that people aren’t getting meaningful work out of o3 or sonnet 3.7, and impossible to argue they’re aligned here. As capabilities increase, even if this gets worse, the models will get more useful, so by default we’ll tolerate more of it. Models have a “misalignment vs usefulness” tradeoff they can make.
I think it’s hard to get a useful model for reasons related to the blatant reward hacking—the difficulty of RL on long horizon tasks without a well-defined reward signal.