‘We conjecture that reinforcement strengthens the behavior-steering computations that guide a system into reinforcement events, and that those behavior-steering computations will only form around abstractions already represented inside of a system at the time of reinforcement. We bet that there are a bunch of quantitative relationships here just waiting to be discovered—that there’s a lot of systematic structure in what learned values form given which training variables. To ever get to these quantitative relationships, we’ll need to muck around with language model fine-tuning under different conditions a lot.’ → this could be (somewhat) relevant: https://openreview.net/forum?id=mNtmhaDkAr
‘We conjecture that reinforcement strengthens the behavior-steering computations that guide a system into reinforcement events, and that those behavior-steering computations will only form around abstractions already represented inside of a system at the time of reinforcement. We bet that there are a bunch of quantitative relationships here just waiting to be discovered—that there’s a lot of systematic structure in what learned values form given which training variables. To ever get to these quantitative relationships, we’ll need to muck around with language model fine-tuning under different conditions a lot.’ → this could be (somewhat) relevant: https://openreview.net/forum?id=mNtmhaDkAr