It seems that we have independently converged on many of the same ideas. Writing is very hard for me and one of my greatest desires is to be scooped, which you’ve done with impressive coverage here, so thank you.
Thanks for writing the simulators post! That crystallized a lot of things I had been bouncing around.
A decision transformer conditioned on an outcome should still predict a probability distribution, and generate trajectories that are typical for the training distribution given the outcome occurs, which is not necessarily the sequence of actions that is optimally likely to result in the outcome.
That’s a good framing.
RL with KL penalties may also aim at a sort of calibration/conservatism having technically a non zero entropy distribution as its optimal policy
I apparently missed this relationship before. That’s interesting, and is directly relevant to one of the neuralese collapses I was thinking about.
Sometimes it’s clear how GPT leaks evidence that it’s GPT, e.g. by getting into a loop.
Good point! That sort of thing does seem sufficient.
I have many thoughts about what an interpretable and controllable interface would look like, particularly for cyborgism, a rabbit hole I’m not going to go down in this comment, but I’m really glad you’ve come to the same question.
I look forward to reading it, should you end up publishing! It does seem like a load bearing piece that I remain pretty uncertain about.
I do wonder if some of this could be pulled into the iterable engineering regime (in a way that’s conceivably relevant at scale). Ideally, there could be an dedicated experiment to judge human ability to catch and control models across different interfaces and problems. That mutual information paper seems like a good step here, and InstructGPT is sorta-kinda a datapoint. On the upside, most possible experiments of this shape seem pretty solidly on the ‘safety’ side of the balance.
Thanks for writing the simulators post! That crystallized a lot of things I had been bouncing around.
That’s a good framing.
I apparently missed this relationship before. That’s interesting, and is directly relevant to one of the neuralese collapses I was thinking about.
Good point! That sort of thing does seem sufficient.
I look forward to reading it, should you end up publishing! It does seem like a load bearing piece that I remain pretty uncertain about.
I do wonder if some of this could be pulled into the iterable engineering regime (in a way that’s conceivably relevant at scale). Ideally, there could be an dedicated experiment to judge human ability to catch and control models across different interfaces and problems. That mutual information paper seems like a good step here, and InstructGPT is sorta-kinda a datapoint. On the upside, most possible experiments of this shape seem pretty solidly on the ‘safety’ side of the balance.