I’m not sure how surprised to be about middle of training, versus final RL policy. Are you saying that this sort of mistake should be learned quickly in RL?
I don’t have a big difference in my model of mid vs. final, they have very similar MMR, the difference between them is pretty small in the scheme of things (e..g probably smaller than the impact of doubling model size) and my picture isn’t refined enough to appreciate those differences. For any particular dumb mistake I’d be surprised if the line between not making it and making it was in that particular doubling.
I’m not sure how surprised to be about middle of training, versus final RL policy. Are you saying that this sort of mistake should be learned quickly in RL?
I don’t have a big difference in my model of mid vs. final, they have very similar MMR, the difference between them is pretty small in the scheme of things (e..g probably smaller than the impact of doubling model size) and my picture isn’t refined enough to appreciate those differences. For any particular dumb mistake I’d be surprised if the line between not making it and making it was in that particular doubling.