less than an hour of thought has gone into this post
that said, it comes from a background of me thinking for a while about how the field of AI alignment should relate to agent foundations research
Maybe obvious to everyone but me, or totally wrong (this doesn’t really grapple with the challenges of working in a domain where an intelligent being might be working against you), but:
we currently don’t know how to make super-smart computers that do our will
this is not just a problem of having a design that is not feasible to implement: we do not even have a sense of what the design would be
I’m trying to somewhat abstract over intent alignment vs control approaches, but am mostly thinking about intent alignment
I have not thought that much about societal/systemic risks very much, and this post doesn’t really address them.
ideally we would figure out how to do this
the closest traction that we have: deep learning seems to work well in practice, altho our theoretical knowledge of why it works so well or how capabilities are implemented is lagging
how should we proceed? Well:
thinking about theory alone has not been practical
probably we need to look at things that exhibit alignment-related phenomena and understand them, and that will help us develop the requisite theory
said things are probably neural networks
there are two ways we can look at neural networks: their behaviour, and their implementation.
looking at behaviour is conceptually straightforward, and valuable, and being done
looking at their implementation is less obvious
what we need is tooling that lets us see relevant things about how neural networks are working
such tools (e.g. SAEs) are not impossible to create, but it is not obvious that their outputs tell us quantities that are actually of interest
in order to discipline the creation of such tools, we should demand that they help us understand models in ways that matter
once we get such tools, we should be trying to use them to understand alignment-relevant phenomena, to build up our theory of what we want out of alignment and how it might be implemented
this is also a thing that looking at the external behaviour of models in alignment-relevant contexts should be doing
so should we be just doing totally empirical things? No.
firstly, we need to be disciplined along the way by making sure that we are looking at settings that are in fact relevant to the alignment problem, when we do our behavioural analysis and benchmark our interpretability tools. This involves having a model of what situations are in fact alignment-relevant, what problems we will face as models get smarter, etc
secondly, once we have the building blocks for theory, ideally we will put them together and make some actual theorems like “in such-and-such situations models will never become deceptive” (where ‘deceptive’ has been satisfactorily operationalized in a way that suffices to derive good outcomes from no deception and relatively benign humans)
I’m imagining the above as being analogous to an imagined history of statistical mechanics (people who know this history or who have read “inventing temperature” should let me know if I’m totally wrong about it):
first we have steam engines etc
then we figure out that ‘temperature’ and ‘entropy’ are relevant things to track for making the engines run
then we relate temperature, entropy, and pressure
then we get a good theory of thermodynamics
then we develop statistical mechanics
exceptions to “theory without empiricism doesn’t work”:
I agree that we probably want most theory to be towards the applied end these days due to short timelines. Empirical work needs theory in order to direct it, theory needs empirics in order to remain grounded.
Thanks for writing this. I think it is a useful model. However, there is one thing I want to push back against:
Looking at behaviour is conceptually straightforward, and valuable, and being done
I agree with Apollo Research that evals isn’t really a science yet. It mostly seems to be conducted according to vibes. Model internals could help with this, but things like building experience or auditing models using different schemes and comparing them could help make this more scientific.
Similarly, a lot of work with Model Organisms of Alignment requires a lot of careful thought to get right.
basically stuff along the lines of “when you put agents in X situation, they tend to do Y thing”, rather than trying to understand latent causes / capabilities
I think the key missing piece you’re pointing at (making sure that our interpretability tools etc actually tell us something alignment-relevant) is one of the big things going on in model organisms of misalignment (iirc there’s a step that’s like ‘ok, but if we do interpretability/control/etc at the model organism does that help?’). Ideally this type of work, or something close to it, could become more common // provide ‘evals for our evals’ // expand in scope and application beyond deep deception.
If that happened, it seems like it would fit the bill here.
Oh except: I did not necessarily mean to claim that any of the things I mentioned were missing from the alignment research scene, or that they were present.
A theory of how alignment research should work
(cross-posted from danielfilan.com)
Epistemic status:
I listened to the Dwarkesh episode with Gwern and started attempting to think about life, the universe, and everything
less than an hour of thought has gone into this post
that said, it comes from a background of me thinking for a while about how the field of AI alignment should relate to agent foundations research
Maybe obvious to everyone but me, or totally wrong (this doesn’t really grapple with the challenges of working in a domain where an intelligent being might be working against you), but:
we currently don’t know how to make super-smart computers that do our will
this is not just a problem of having a design that is not feasible to implement: we do not even have a sense of what the design would be
I’m trying to somewhat abstract over intent alignment vs control approaches, but am mostly thinking about intent alignment
I have not thought that much about societal/systemic risks very much, and this post doesn’t really address them.
ideally we would figure out how to do this
the closest traction that we have: deep learning seems to work well in practice, altho our theoretical knowledge of why it works so well or how capabilities are implemented is lagging
how should we proceed? Well:
thinking about theory alone has not been practical
probably we need to look at things that exhibit alignment-related phenomena and understand them, and that will help us develop the requisite theory
said things are probably neural networks
there are two ways we can look at neural networks: their behaviour, and their implementation.
looking at behaviour is conceptually straightforward, and valuable, and being done
looking at their implementation is less obvious
what we need is tooling that lets us see relevant things about how neural networks are working
such tools (e.g. SAEs) are not impossible to create, but it is not obvious that their outputs tell us quantities that are actually of interest
in order to discipline the creation of such tools, we should demand that they help us understand models in ways that matter
see Stephen Casper’s engineer’s interpretability sequence, Jason Gross on compact proofs
once we get such tools, we should be trying to use them to understand alignment-relevant phenomena, to build up our theory of what we want out of alignment and how it might be implemented
this is also a thing that looking at the external behaviour of models in alignment-relevant contexts should be doing
so should we be just doing totally empirical things? No.
firstly, we need to be disciplined along the way by making sure that we are looking at settings that are in fact relevant to the alignment problem, when we do our behavioural analysis and benchmark our interpretability tools. This involves having a model of what situations are in fact alignment-relevant, what problems we will face as models get smarter, etc
secondly, once we have the building blocks for theory, ideally we will put them together and make some actual theorems like “in such-and-such situations models will never become deceptive” (where ‘deceptive’ has been satisfactorily operationalized in a way that suffices to derive good outcomes from no deception and relatively benign humans)
I’m imagining the above as being analogous to an imagined history of statistical mechanics (people who know this history or who have read “inventing temperature” should let me know if I’m totally wrong about it):
first we have steam engines etc
then we figure out that ‘temperature’ and ‘entropy’ are relevant things to track for making the engines run
then we relate temperature, entropy, and pressure
then we get a good theory of thermodynamics
then we develop statistical mechanics
exceptions to “theory without empiricism doesn’t work”:
thinking about deceptive mesa-optimization
RLHF failures
CIRL analysis
lesson of above: theory does seem to help us analyze some issues and raise possibilities
I agree that we probably want most theory to be towards the applied end these days due to short timelines. Empirical work needs theory in order to direct it, theory needs empirics in order to remain grounded.
Thanks for writing this. I think it is a useful model. However, there is one thing I want to push back against:
I agree with Apollo Research that evals isn’t really a science yet. It mostly seems to be conducted according to vibes. Model internals could help with this, but things like building experience or auditing models using different schemes and comparing them could help make this more scientific.
Similarly, a lot of work with Model Organisms of Alignment requires a lot of careful thought to get right.
When I wrote that, I wasn’t thinking so much about evals / model organisms as stuff like:
putting a bunch of agents in a simulated world and seeing how they interact
weak-to-strong / easy-to-hard generalization
basically stuff along the lines of “when you put agents in X situation, they tend to do Y thing”, rather than trying to understand latent causes / capabilities
I think the key missing piece you’re pointing at (making sure that our interpretability tools etc actually tell us something alignment-relevant) is one of the big things going on in model organisms of misalignment (iirc there’s a step that’s like ‘ok, but if we do interpretability/control/etc at the model organism does that help?’). Ideally this type of work, or something close to it, could become more common // provide ‘evals for our evals’ // expand in scope and application beyond deep deception.
If that happened, it seems like it would fit the bill here.
Does that seem true to you?
Yeah, that seems right to me.
Oh except: I did not necessarily mean to claim that any of the things I mentioned were missing from the alignment research scene, or that they were present.