SGD has inductive biases, but we’d have to actually engineer them to get high V rather than high X when only trained on U=V+X. In the Gao et al paper, optimization and overoptimization happened at the same relative rate in RL as in conditioning, so I think the null hypothesis is that training does about as well as conditioning. I’m pretty excited about work that improves on that paper to get higher gold reward while only having access to the proxy reward model.
I think the point still holds in mainline shard theory world, which in my understanding is using reward shaping + interp to get an agent composed of shards that value proxies that more often correlate with high V rather than higher X, where we are selecting on something other than U=V+X. When the AI ultimately outputs a plan for alignment, why would it inherently value having the accurate plan, rather than inherently value misleading humans? I think we agree that it’s because SGD has inductive biases and we understand them well enough to do directionally better than conditioning at constructing an AI that does what we want.
SGD has inductive biases, but we’d have to actually engineer them to get high V rather than high X when only trained on U=V+X. In the Gao et al paper, optimization and overoptimization happened at the same relative rate in RL as in conditioning, so I think the null hypothesis is that training does about as well as conditioning. I’m pretty excited about work that improves on that paper to get higher gold reward while only having access to the proxy reward model.
I think the point still holds in mainline shard theory world, which in my understanding is using reward shaping + interp to get an agent composed of shards that value proxies that more often correlate with high V rather than higher X, where we are selecting on something other than U=V+X. When the AI ultimately outputs a plan for alignment, why would it inherently value having the accurate plan, rather than inherently value misleading humans? I think we agree that it’s because SGD has inductive biases and we understand them well enough to do directionally better than conditioning at constructing an AI that does what we want.