Neuroscience and Alignment
I’ve been in many conversations where I’ve mentioned the idea of using neuroscience for outer alignment, and the people who I’m talking to usually seem pretty confused about why I would want to do that. Well, I’m confused about why one wouldn’t want to do that, and in this post I explain why.
As far as I see it, there are three main strategies people have for trying to deal with AI alignment in worlds where AI alignment is hard.
Value alignment
Corrigibility
Control/scalable alignment
In my opinion, these are all great efforts, but I personally like the idea of working on value alignment directly. Why? First some negatives of the others:
Corrigibility requires moderate to extreme levels of philosophical deconfusion, an effort worth doing for some, but a very small set not including myself. Another negative of this approach is that by-default the robust solutions to the problems won’t be easily implementable in deep learning.
Control/scalable alignment requires understanding the capabilities & behaviors of inherently unpredictable systems. Sounds hard![1]
Why is value alignment different from these? Because we have working example of a value-aligned system right in front of us: The human brain. This permits an entirely scientific approach, requiring minimal philosophical deconfusion. And in contrast to corrigibility solutions, biological and artificial neural-networks are based upon the same fundamental principles, so there’s a much greater chance that insights from the one easily work in the other.
In the most perfect world, we would never touch corrigibility or control with a 10-foot stick, and instead once we realized the vast benefits and potential pitfalls of AGI, we’d get to work on decoding human values (or more likely the generators of human values) directly from the source.
Indeed, in worlds where control or scalable alignment go well, I expect the research area our AI minions will most prioritize is neuroscience. The AIs will likely be too dumb or have the wrong inductive biases to hold an entire human morality in their head, and even if they do, we don’t know whether they do, so we need them to demonstrate that their values are the same as our values in a way which can’t be gamed by exploiting our many biases or philosophical inadequacies. The best way to do that is through empiricism, directly studying & making predictions about the thing you’re trying to explain.
The thing is, we don’t need to wait until potentially transformative AGI in order to start doing that research, we can do it now! And even use presently existing AIs to help!
I am hopeful there are in fact clean values or generators of values in our brains such that we could just understand those mechanisms, and not other mechanisms. In worlds where this is not the case, I get more pessimistic about our chances of ever aligning AIs, because in those worlds all computations in the brain are necessary to do a “human morality”, which means that if you try to do, say, RLHF or DPO to your model and hope that it ends up aligned afterwards, it will not be aligned, because it is not literally simulating an entire human brain. It’s doing less than that, and so it must be some necessary computation its missing.
Put another way, worlds where you need to understand the entire human brain to understand human morality are often also worlds where human morality is incredibly complex, so value learning approaches are less likely to succeed, and the only aligned AIs are those which are digital emulations of human brains. Thus again, neuroscience is even more necessary.
Thanks to @Jozdien for comments
- ^
I usually see people say “we do control so we can do scalable alignment”, where scalable alignment is taking a small model and having it align a larger model, and figuring out procedures such that the larger model can only get more aligned than the smaller model. This IMO has very similar problems to control, so I lump together the strategy & criticism.
There are two different hypotheses I feel like this is equivocating between: there being values in our brain that are represented in a clean enough fashion that you can identify them by looking at the brain bottom-up; and that there are values in our brain that can be separated out from other things that aren’t necessary to “human morality”, but that requires answering some questions of how to separate them. Descriptiveness vs prescriptiveness—or from a mech interp standpoint, one of my core disagreements with the mech interp people where they think we can identify values or other high-level concepts like deception simply by looking at the model’s linear representations bottom-up, where I think that’ll be a highly non-trivial problem.
I say its fine if you do top-down research in neuroscience. That’s currently an under-rated field of reearch there. For example, see the HTM school.
Yeah, I suppose then for me that class of research ends up looking pretty similar to solving the hard conceptual problem directly? Like, in that the problem in both cases is “understand what you want to interface with in this system in front of you that has property X you want to interface with”. I can see an argument for this being easier with ML systems than brains because of the operability.
This is true, and is why I’m working with ML systems, but learning about neuroscience, and taking opportunities to do neuroscience work when it fits, and trying to generalize your findings to neuroscience should be a big goal of your work. I know it is for mine.
Yeah, that’s fair. Two minor points then:
First is that I don’t really expect us to come up with a fully general answer to this problem in time. I wouldn’t be surprised if we had to trade off some generality for indexing on the system in front of us—this gets us some degree of non-robustness, but hopefully enough to buy us a lot more time before stuff like the problem behind deep deception breaks a lack of True Names. Hopefully then we can get the AI systems to solve the harder problem for us in the time we’ve bought, with systems more powerful than us. The relevance here is that if this is the case, then trying to generalize our findings to an entirely non-ML setting, while definitely something we want, might not be something we get, and maybe it makes sense to index lightly on a particular paradigm if the general problem seems really hard.
Second is that this claim as stated in your reply seems like a softer one (that I agree with more) than the one made in the overall post.
The answer does not need to be fully general, it just needs to be general for ML systems, and human brains. In worlds where this is hard, where the values AIs have are fundamentally different from the values humans have, such that there is no corresponding values-like concept in human value space which doesn’t route through a utility function which the AI’s values can be mapped on to, your AI is Definitely For Sure Not Aligned with you. Maybe its corrigible, but I don’t think that’s the situation you have in mind.
My claim is that neuroscience turns value alignment into a technical problem, and therefore some neuroscience research is useful. I don’t discuss the policy implications of this observation.
If I had access to a neuroscience or tech lab (and the relevant skills), I’d be doing that rather than ML.
When I think of how I approach solving this problem in practice, I think of interfacing with structures within ML systems that satisfy an increasing list of desiderata for values, covering the rest with standard mech interp techniques, and then steering them with human preferences. I certainly think it’s probable that there are valuable insights to this process from neuroscience, but I don’t think a good solution to this problem (under the constraints I mention above) requires that it be general to the human brain as well. We steer the system with our preferences (and interfacing with internal objectives seems to avoid the usual problems with preferences) - while something that allows us to actually directly translate our values from our system to theirs would be great, I expect that constraint of generality to make it harder than necessary.
I think that it’s a technical problem too—maybe you mean we don’t have to answer weird philosophical problems? But then I would say that we still have to answer the question of what we want to interface with / how to do top-down neuroscience, which I would call a technical problem but which I think routes through things that I think you may consider philosophical problems (of the nature “what do we want to interface with / what do we mean by values in this system?”)
I think the seeds of an interdisciplinary agenda on this are already there, see e.g. https://manifund.org/projects/activation-vector-steering-with-bci, https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe?commentId=WLCcQS5Jc7NNDqWi5, https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe?commentId=D6NCcYF7Na5bpF5h5, https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=A8muL55dYxR3tv5wp and maybe my other comments on this post.
I might have a shortform going into more details on this soon, ot at least by the time of https://foresight.org/2024-foresight-neurotech-bci-and-wbe-for-safe-ai-workshop/.
I guess I’m just like, no matter what you do, you’re going to need to translate human values into AI values. The methodology you’re proposing is some kind of steering sort of thing, where you have knobs, I’m assuming, which you can turn to emphasize or de-emphasize certain values inside of your ML system. And there’s a particular setting of these knobs, which get you human values, and your job is to figure out what that setting of the knobs is.
I think this works fine for worlds where alignment is pretty easy. This sounds a lot like Alex Turner’s current plan, but I don’t think it works well in worlds where alignment is hard. In worlds where alignment is hard, it’s not necessarily guaranteed that the AI will even have values which are close to your own values, and you may need to do some interventions into your AI or rethink how you train your AI so that it has values which are similar to yours.
It sounds to me like you should seriously consider doing work which might look like e.g. https://www.lesswrong.com/posts/eruHcdS9DmQsgLqd4/inducing-human-like-biases-in-moral-reasoning-lms, Getting aligned on representational alignment, Training language models to summarize narratives improves brain alignment; also see a lot of recent works from Ilia Sucholutsky and from this workshop.
yes, e.g. https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=GRjfMwLDFgw6qLnDv
I’m not sure anyone I know in mech interp is claiming this is a non-trivial problem.
Yeah sorry I should have been more precise. I think it’s so non-trivial that it plausibly contains most of the difficulty in the overall problem—which is a statement I think many people working on mechanistic interpretability would disagree with.
I think you meant “human morality”
I did! You are the second person to point that out, which means it must be a particularly egregious spelling error. Corrected!
The similarities go even deeper, I’d say, see e.g. The neuroconnectionist research programme for a review and quite a few of my past linkposts (e.g. on representational alignment and how it could be helpful for value alignment, on evidence of [by default] (some) representational alignment between LLMs and humans, etc.); and https://www.lesswrong.com/posts/eruHcdS9DmQsgLqd4/inducing-human-like-biases-in-moral-reasoning-lms.
I’ve had related thoughts (and still do, though it’s become more of a secondary research agenda); might be interesting to chat (more) during https://foresight.org/2024-foresight-neurotech-bci-and-wbe-for-safe-ai-workshop/.
The brain algorithms that do moral reasoning are value-aligned in the same way a puddle is aligned with the shape of the hole it’s in.
They’re shaped by all sorts of forces, ranging from social environment to biological facts like how we can’t make our brains twice as large. Not just during development, but on an ongoing basis our moral reasoning exists in a balance with all these other forces. But of course, a puddle always coincidentally finds itself in a hole that’s perfectly shaped for it.
If you took the decision-making algorithms from my brain and put them into a brain 357x larger, that tautological magic spell might break, and the puddle that you’ve moved into a different hole might no longer be the same shape as it was in the original hole.
If you anticipate this general class of problems and try to resolve them, that’s great! I’m not saying nobody should do neuroscience. It’s just I don’t think it’s a “entirely scientific approach, requiring minimal philosophical deconfusion,” nor does it lead to safe AIs that are just emulations of humans except smarter.
I think that even if something is lost in that galaxy-brain, for most it would be hard to lose the drives not to kill your friends and family for meaningless squiggles. But maybe this is the case. Either way, I don’t think you need radical philosophical de-confusion in order to solve this problem. You just need to understand what does and doesn’t determine what values you have. What you describe is determining the boundary conditions for a complicated process, not figuring out what “justice” is. It has the potential to be a hard technical problem, but technical it still is.
A clarification about in what sense I claim “biological and artificial neural-networks are based upon the same fundamental principles”:
I would not be surprised if the reasons why neural networks “work” are also exploited by the brain.
In particular why I think neuroscience for value alignment is good is because we can expect that the values part of the brain will be compatible with these reasons, and won’t require too much extra fundamental advances to actually implement, unlike say corrigibility, which will first progress from ideal utility maximizers, and then require a mapping from that to neural networks, which seems potentially just as hard as writing an AGI from scratch.
In the case where human values are incompatible with artificial neural networks, again I get much more pessimistic about all alternative forms of value alignment of neural networks.
I’m confused by this statement. Do we know this? Do we have enough of an understanding of either to say this? Don’t get me wrong, there’s some level on which I totally buy this. However, I’m just highly uncertain about what is really being claimed here.
Does this comment I wrote clear up my claim?
It helps a little but I feel like we’re operating at too high a level of abstraction.
I don’t think this is necessarily true. I don’t think emulated human brains are necessary for full alignment, nor whether emulated human brains would be more aligned than a well calibrated and scaled up version of our current alignment techniques (+ new ones to be discovered in the next few years). To emulate the entire human brain to align values seem to be not only implausible (even with neurmorphic computing, efficient neural networks and Moore’s law^1000), it seems like an overkill and a misallocation of valuable computational resources. Assuming I’m understanding “emulated human brains” correctly, emulation would mean pseudo-sentient systems solely designed to be aligned to our values. Perhaps morality can be a bit simpler than that, somewhere in the middle of static, written rules (the law) and the unpredictable human mind. Because if we do make more of people essentially, it’s not really addressing the “many biases or philosophical inadequacies” of us.
This conclusion was the result of conditioning on the world where in order to decode human values from the brain, we need to understand the entire brain. I agree with you when this is not the case, but in different degrees depending on how much of the brain must be decoded.