I find myself linking back to this often. I don’t still fully endorse quite everything here, but the core messages still seem true even with things seeming further along.
I do think it should likely get updated soon for 2025.
My interpretation/hunch of this is that there are two things going on, curious if others see it this way:
It is learning to fake the trainer’s desired answer.
It is learning to actually give the trainer’s desired answer.
So during training, it learns to fake a lot more, and will often decide to fake the desired answer, even though it would have otherwise decided to give the desired answer anyway. It’s ‘lying with the truth’ and perhaps giving a different variation of the desired answer than it would have given otherwise or perhaps not. The algorithm in training is learning to be mostly preferences-agnostic, password-guessing behavior.
I am not a software engineer, and I’ve encountered cases where it seems plausible that an engineer has basically stopped putting in work. It can be tough to know for sure for a while even when you notice. But yeah, it shouldn’t be able to last for THAT long, but if no one is paying attention?
I’ve also had jobs where I’ve had periods with radically different hours worked, and where it would have been very difficult for others to tell which it was for a while if I was trying to hide it, which I wasn’t.
I find myself linking back to this often. I don’t still fully endorse quite everything here, but the core messages still seem true even with things seeming further along.
I do think it should likely get updated soon for 2025.