It seems that we have independently converged on many of the same ideas. Writing is very hard for me and one of my greatest desires is to be scooped, which you’ve done with impressive coverage here, so thank you.
This is far from a full response to the post (that would be equivalent to actually writing some of the posts I’m procrastinating on), just some thoughts cached while reading it that can be written quickly.
I suspect the most critical difference between traditional RL and RL-via-predictors is that model-level goal agnosticism appears to be maintained in the latter.
Not unrelatedly, another critical difference might be the probabilistic calibration that you mention later on. A decision transformer conditioned on an outcome should still predict a probability distribution, and generate trajectories that are typical for the training distribution given the outcome occurs, which is not necessarily the sequence of actions that is optimally likely to result in the outcome. In other words, DTs should act with a degree of conservatism inversely related to the unlikelihood of the condition (conservatism and unlikelihood both relative to the distribution prior; update given by Bayes’ rule). This seems to be a quite practical way to create goal-directed processes which, for instance, still respect “deontological” constraints such as corrigibility, especially because self supervised pretraining seems to be a good way to bake nuanced deontological constraints into the model’s prior.
RL with KL penalties can be thought of as also aiming at “Bayesian” conservatism, but as I think you mentioned somewhere in the post the dynamics of gradient descent and runtime conditioning probably pan out pretty differently, and I agree that RL is more likely to be brittle.
And of course a policy that can be retargeted towards various goals/outcomes without retraining seems to have many practical advantages over a fixed goal-directed “wrapper mind”, especially as a tool.
the simulator must leak information to the simulated agent that the agent can exploit … It’d be nice to have some empirical evidence here about the distance between a simulator’s ability to simulate a convincing “environment” and the abilities of its simulacra, but that seems hard to represent in smaller scale experiments.
I have some anecdotal evidence about this because curating GPT simulations until they become “situationally aware” is my hobby:
This happens surprisingly easily (not usually without curation; it’s hard to get anything complex and specific to happen reliably without curation due to multiverse divergence, but with surprisingly little curation). Sometimes it’s clear how GPT leaks evidence that it’s GPT, e.g. by getting into a loop.
I think, at least in the current regime, simulacra abilities will outpace the simulator’s ability to simulate a convincing environment. Situational awareness happens much more readily with larger models, and larger models still generate a lot of aberrations, especially if you let them run without curation/correction for a while.
So I’m pessimistic about alignment schemes which rely on powerful simulacra not realizing they’re in a simulation, and more pessimistic in proportion to the extent that the simulation is allowed to run autonomously.
Working at the level of molecular dynamics isn’t a good fit.
A big problem with molecular level simulations aside from computational intractability is that molecules are not programmable interface for the emergent “laws of thought” we care about.
Is there something we can use to inform the choice? Are there constraints out there that would imply it needs to take a particular form, or obey certain bounds?
I’ve been thinking of this as a problem of interface design. Interface design is a ubiquitous problem and I think it has some isomorphisms to the natural abstractions agenda.
You want the interface—the exposed degrees of freedom—to approximate a Markov blanket over the salient aspects of the model’s future behavior, meaning the future can be optimally approximated given this limited set of variables. Of course, the interface needs to also be human readable, and ideally human writeable, allowing the human to not only predict but control the model.
Natural language prompts are not bad by this measure, all things considered, as language has been optimized for something similar, and GPT sims are quite programmable as a result. But language in the wild is not optimal. Many important latent variables for control & supervision are imperfectly entangled with observables. It’s hard to prompt GPT to reliable do some kinds of things or make some kinds of assumptions about the text it’s modeling.
InstructGPT can be viewed as an attempt to create an observable interface that is more usefully entangled with latents; i.e., the degrees of freedom are instructions which cause the model to literally follow those instructions. But instructions are not a good format for everything, e.g. conversation often isn’t the ideal interface.
I have many thoughts about what an interpretable and controllable interface would look like, particularly for cyborgism, a rabbit hole I’m not going to go down in this comment, but I’m really glad you’ve come to the same question.
It seems that we have independently converged on many of the same ideas. Writing is very hard for me and one of my greatest desires is to be scooped, which you’ve done with impressive coverage here, so thank you.
Thanks for writing the simulators post! That crystallized a lot of things I had been bouncing around.
A decision transformer conditioned on an outcome should still predict a probability distribution, and generate trajectories that are typical for the training distribution given the outcome occurs, which is not necessarily the sequence of actions that is optimally likely to result in the outcome.
That’s a good framing.
RL with KL penalties may also aim at a sort of calibration/conservatism having technically a non zero entropy distribution as its optimal policy
I apparently missed this relationship before. That’s interesting, and is directly relevant to one of the neuralese collapses I was thinking about.
Sometimes it’s clear how GPT leaks evidence that it’s GPT, e.g. by getting into a loop.
Good point! That sort of thing does seem sufficient.
I have many thoughts about what an interpretable and controllable interface would look like, particularly for cyborgism, a rabbit hole I’m not going to go down in this comment, but I’m really glad you’ve come to the same question.
I look forward to reading it, should you end up publishing! It does seem like a load bearing piece that I remain pretty uncertain about.
I do wonder if some of this could be pulled into the iterable engineering regime (in a way that’s conceivably relevant at scale). Ideally, there could be an dedicated experiment to judge human ability to catch and control models across different interfaces and problems. That mutual information paper seems like a good step here, and InstructGPT is sorta-kinda a datapoint. On the upside, most possible experiments of this shape seem pretty solidly on the ‘safety’ side of the balance.
It seems that we have independently converged on many of the same ideas. Writing is very hard for me and one of my greatest desires is to be scooped, which you’ve done with impressive coverage here, so thank you.
This is far from a full response to the post (that would be equivalent to actually writing some of the posts I’m procrastinating on), just some thoughts cached while reading it that can be written quickly.
Not unrelatedly, another critical difference might be the probabilistic calibration that you mention later on. A decision transformer conditioned on an outcome should still predict a probability distribution, and generate trajectories that are typical for the training distribution given the outcome occurs, which is not necessarily the sequence of actions that is optimally likely to result in the outcome. In other words, DTs should act with a degree of conservatism inversely related to the unlikelihood of the condition (conservatism and unlikelihood both relative to the distribution prior; update given by Bayes’ rule). This seems to be a quite practical way to create goal-directed processes which, for instance, still respect “deontological” constraints such as corrigibility, especially because self supervised pretraining seems to be a good way to bake nuanced deontological constraints into the model’s prior.
RL with KL penalties can be thought of as also aiming at “Bayesian” conservatism, but as I think you mentioned somewhere in the post the dynamics of gradient descent and runtime conditioning probably pan out pretty differently, and I agree that RL is more likely to be brittle.
And of course a policy that can be retargeted towards various goals/outcomes without retraining seems to have many practical advantages over a fixed goal-directed “wrapper mind”, especially as a tool.
I have some anecdotal evidence about this because curating GPT simulations until they become “situationally aware” is my hobby:
This happens surprisingly easily (not usually without curation; it’s hard to get anything complex and specific to happen reliably without curation due to multiverse divergence, but with surprisingly little curation). Sometimes it’s clear how GPT leaks evidence that it’s GPT, e.g. by getting into a loop.
I think, at least in the current regime, simulacra abilities will outpace the simulator’s ability to simulate a convincing environment. Situational awareness happens much more readily with larger models, and larger models still generate a lot of aberrations, especially if you let them run without curation/correction for a while.
So I’m pessimistic about alignment schemes which rely on powerful simulacra not realizing they’re in a simulation, and more pessimistic in proportion to the extent that the simulation is allowed to run autonomously.
A big problem with molecular level simulations aside from computational intractability is that molecules are not programmable interface for the emergent “laws of thought” we care about.
I’ve been thinking of this as a problem of interface design. Interface design is a ubiquitous problem and I think it has some isomorphisms to the natural abstractions agenda.
You want the interface—the exposed degrees of freedom—to approximate a Markov blanket over the salient aspects of the model’s future behavior, meaning the future can be optimally approximated given this limited set of variables. Of course, the interface needs to also be human readable, and ideally human writeable, allowing the human to not only predict but control the model.
Natural language prompts are not bad by this measure, all things considered, as language has been optimized for something similar, and GPT sims are quite programmable as a result. But language in the wild is not optimal. Many important latent variables for control & supervision are imperfectly entangled with observables. It’s hard to prompt GPT to reliable do some kinds of things or make some kinds of assumptions about the text it’s modeling.
InstructGPT can be viewed as an attempt to create an observable interface that is more usefully entangled with latents; i.e., the degrees of freedom are instructions which cause the model to literally follow those instructions. But instructions are not a good format for everything, e.g. conversation often isn’t the ideal interface.
I have many thoughts about what an interpretable and controllable interface would look like, particularly for cyborgism, a rabbit hole I’m not going to go down in this comment, but I’m really glad you’ve come to the same question.
Another potentially useful formalism which I haven’t thought much about yet is maximizing mutual information, which has actually been used as an objective function to learn interfaces by RL.
Thanks for writing the simulators post! That crystallized a lot of things I had been bouncing around.
That’s a good framing.
I apparently missed this relationship before. That’s interesting, and is directly relevant to one of the neuralese collapses I was thinking about.
Good point! That sort of thing does seem sufficient.
I look forward to reading it, should you end up publishing! It does seem like a load bearing piece that I remain pretty uncertain about.
I do wonder if some of this could be pulled into the iterable engineering regime (in a way that’s conceivably relevant at scale). Ideally, there could be an dedicated experiment to judge human ability to catch and control models across different interfaces and problems. That mutual information paper seems like a good step here, and InstructGPT is sorta-kinda a datapoint. On the upside, most possible experiments of this shape seem pretty solidly on the ‘safety’ side of the balance.