I am hopeful there are in fact clean values or generators of values in our brains such that we could just understand those mechanisms, and not other mechanisms. In worlds where this is not the case, I get more pessimistic about our chances of ever aligning AIs, because in those worlds all computations in the brain are necessary to do a “human mortality”, which means that if you try to do, say, RLHF or DPO to your model and hope that it ends up aligned afterwards, it will not be aligned, because it is not literally simulating an entire human brain. It’s doing less than that, and so it must be some necessary computation its missing.
There are two different hypotheses I feel like this is equivocating between: there being values in our brain that are represented in a clean enough fashion that you can identify them by looking at the brain bottom-up; and that there are values in our brain that can be separated out from other things that aren’t necessary to “human morality”, but that requires answering some questions of how to separate them. Descriptiveness vs prescriptiveness—or from a mech interp standpoint, one of my core disagreements with the mech interp people where they think we can identify values or other high-level concepts like deception simply by looking at the model’s linear representations bottom-up, where I think that’ll be a highly non-trivial problem.
Yeah, I suppose then for me that class of research ends up looking pretty similar to solving the hard conceptual problem directly? Like, in that the problem in both cases is “understand what you want to interface with in this system in front of you that has property X you want to interface with”. I can see an argument for this being easier with ML systems than brains because of the operability.
This is true, and is why I’m working with ML systems, but learning about neuroscience, and taking opportunities to do neuroscience work when it fits, and trying to generalize your findings to neuroscience should be a big goal of your work. I know it is for mine.
First is that I don’t really expect us to come up with a fully general answer to this problem in time. I wouldn’t be surprised if we had to trade off some generality for indexing on the system in front of us—this gets us some degree of non-robustness, but hopefully enough to buy us a lot more time before stuff like the problem behind deep deception breaks a lack of True Names. Hopefully then we can get the AI systems to solve the harder problem for us in the time we’ve bought, with systems more powerful than us. The relevance here is that if this is the case, then trying to generalize our findings to an entirely non-ML setting, while definitely something we want, might not be something we get, and maybe it makes sense to index lightly on a particular paradigm if the general problem seems really hard.
Second is that this claim as stated in your reply seems like a softer one (that I agree with more) than the one made in the overall post.
The answer does not need to be fully general, it just needs to be general for ML systems, and human brains. In worlds where this is hard, where the values AIs have are fundamentally different from the values humans have, such that there is no corresponding values-like concept in human value space which doesn’t route through a utility function which the AI’s values can be mapped on to, your AI is Definitely For Sure Not Aligned with you. Maybe its corrigible, but I don’t think that’s the situation you have in mind.
My claim is that neuroscience turns value alignment into a technical problem, and therefore some neuroscience research is useful. I don’t discuss the policy implications of this observation.
If I had access to a neuroscience or tech lab (and the relevant skills), I’d be doing that rather than ML.
When I think of how I approach solving this problem in practice, I think of interfacing with structures within ML systems that satisfy an increasing list of desiderata for values, covering the rest with standard mech interp techniques, and then steering them with human preferences. I certainly think it’s probable that there are valuable insights to this process from neuroscience, but I don’t think a good solution to this problem (under the constraints I mention above) requires that it be general to the human brain as well. We steer the system with our preferences (and interfacing with internal objectives seems to avoid the usual problems with preferences) - while something that allows us to actually directly translate our values from our system to theirs would be great, I expect that constraint of generality to make it harder than necessary.
I think that it’s a technical problem too—maybe you mean we don’t have to answer weird philosophical problems? But then I would say that we still have to answer the question of what we want to interface with / how to do top-down neuroscience, which I would call a technical problem but which I think routes through things that I think you may consider philosophical problems (of the nature “what do we want to interface with / what do we mean by values in this system?”)
When I think of how I approach solving this problem in practice, I think of interfacing with structures within ML systems that satisfy an increasing list of desiderata for values, covering the rest with standard mech interp techniques, and then steering them with human preferences. I certainly think it’s probable that there are valuable insights to this process from neuroscience, but I don’t think a good solution to this problem (under the constraints I mention above) requires that it be general to the human brain as well. We steer the system with our preferences (and interfacing with internal objectives seems to avoid the usual problems with preferences) - while something that allows us to actually directly translate our values from our system to theirs would be great, I expect that constraint of generality to make it harder than necessary.
I guess I’m just like, no matter what you do, you’re going to need to translate human values into AI values. The methodology you’re proposing is some kind of steering sort of thing, where you have knobs, I’m assuming, which you can turn to emphasize or de-emphasize certain values inside of your ML system. And there’s a particular setting of these knobs, which get you human values, and your job is to figure out what that setting of the knobs is.
I think this works fine for worlds where alignment is pretty easy. This sounds a lot like Alex Turner’s current plan, but I don’t think it works well in worlds where alignment is hard. In worlds where alignment is hard, it’s not necessarily guaranteed that the AI will even have values which are close to your own values, and you may need to do some interventions into your AI or rethink how you train your AI so that it has values which are similar to yours.
First is that I don’t really expect us to come up with a fully general answer to this problem in time. I wouldn’t be surprised if we had to trade off some generality for indexing on the system in front of us—this gets us some degree of non-robustness, but hopefully enough to buy us a lot more time before stuff like the problem behind deep deception breaks a lack of True Names. Hopefully then we can get the AI systems to solve the harder problem for us in the time we’ve bought, with systems more powerful than us. The relevance here is that if this is the case, then trying to generalize our findings to an entirely non-ML setting, while definitely something we want, might not be something we get, and maybe it makes sense to index lightly on a particular paradigm if the general problem seems really hard.
with the mech interp people where they think we can identify values or other high-level concepts like deception simply by looking at the model’s linear representations bottom-up, where I think that’ll be a highly non-trivial problem.
I’m not sure anyone I know in mech interp is claiming this is a non-trivial problem.
Yeah sorry I should have been more precise. I think it’s so non-trivial that it plausibly contains most of the difficulty in the overall problem—which is a statement I think many people working on mechanistic interpretability would disagree with.
There are two different hypotheses I feel like this is equivocating between: there being values in our brain that are represented in a clean enough fashion that you can identify them by looking at the brain bottom-up; and that there are values in our brain that can be separated out from other things that aren’t necessary to “human morality”, but that requires answering some questions of how to separate them. Descriptiveness vs prescriptiveness—or from a mech interp standpoint, one of my core disagreements with the mech interp people where they think we can identify values or other high-level concepts like deception simply by looking at the model’s linear representations bottom-up, where I think that’ll be a highly non-trivial problem.
I say its fine if you do top-down research in neuroscience. That’s currently an under-rated field of reearch there. For example, see the HTM school.
Yeah, I suppose then for me that class of research ends up looking pretty similar to solving the hard conceptual problem directly? Like, in that the problem in both cases is “understand what you want to interface with in this system in front of you that has property X you want to interface with”. I can see an argument for this being easier with ML systems than brains because of the operability.
This is true, and is why I’m working with ML systems, but learning about neuroscience, and taking opportunities to do neuroscience work when it fits, and trying to generalize your findings to neuroscience should be a big goal of your work. I know it is for mine.
Yeah, that’s fair. Two minor points then:
First is that I don’t really expect us to come up with a fully general answer to this problem in time. I wouldn’t be surprised if we had to trade off some generality for indexing on the system in front of us—this gets us some degree of non-robustness, but hopefully enough to buy us a lot more time before stuff like the problem behind deep deception breaks a lack of True Names. Hopefully then we can get the AI systems to solve the harder problem for us in the time we’ve bought, with systems more powerful than us. The relevance here is that if this is the case, then trying to generalize our findings to an entirely non-ML setting, while definitely something we want, might not be something we get, and maybe it makes sense to index lightly on a particular paradigm if the general problem seems really hard.
Second is that this claim as stated in your reply seems like a softer one (that I agree with more) than the one made in the overall post.
The answer does not need to be fully general, it just needs to be general for ML systems, and human brains. In worlds where this is hard, where the values AIs have are fundamentally different from the values humans have, such that there is no corresponding values-like concept in human value space which doesn’t route through a utility function which the AI’s values can be mapped on to, your AI is Definitely For Sure Not Aligned with you. Maybe its corrigible, but I don’t think that’s the situation you have in mind.
My claim is that neuroscience turns value alignment into a technical problem, and therefore some neuroscience research is useful. I don’t discuss the policy implications of this observation.
If I had access to a neuroscience or tech lab (and the relevant skills), I’d be doing that rather than ML.
When I think of how I approach solving this problem in practice, I think of interfacing with structures within ML systems that satisfy an increasing list of desiderata for values, covering the rest with standard mech interp techniques, and then steering them with human preferences. I certainly think it’s probable that there are valuable insights to this process from neuroscience, but I don’t think a good solution to this problem (under the constraints I mention above) requires that it be general to the human brain as well. We steer the system with our preferences (and interfacing with internal objectives seems to avoid the usual problems with preferences) - while something that allows us to actually directly translate our values from our system to theirs would be great, I expect that constraint of generality to make it harder than necessary.
I think that it’s a technical problem too—maybe you mean we don’t have to answer weird philosophical problems? But then I would say that we still have to answer the question of what we want to interface with / how to do top-down neuroscience, which I would call a technical problem but which I think routes through things that I think you may consider philosophical problems (of the nature “what do we want to interface with / what do we mean by values in this system?”)
I think the seeds of an interdisciplinary agenda on this are already there, see e.g. https://manifund.org/projects/activation-vector-steering-with-bci, https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe?commentId=WLCcQS5Jc7NNDqWi5, https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe?commentId=D6NCcYF7Na5bpF5h5, https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=A8muL55dYxR3tv5wp and maybe my other comments on this post.
I might have a shortform going into more details on this soon, ot at least by the time of https://foresight.org/2024-foresight-neurotech-bci-and-wbe-for-safe-ai-workshop/.
I guess I’m just like, no matter what you do, you’re going to need to translate human values into AI values. The methodology you’re proposing is some kind of steering sort of thing, where you have knobs, I’m assuming, which you can turn to emphasize or de-emphasize certain values inside of your ML system. And there’s a particular setting of these knobs, which get you human values, and your job is to figure out what that setting of the knobs is.
I think this works fine for worlds where alignment is pretty easy. This sounds a lot like Alex Turner’s current plan, but I don’t think it works well in worlds where alignment is hard. In worlds where alignment is hard, it’s not necessarily guaranteed that the AI will even have values which are close to your own values, and you may need to do some interventions into your AI or rethink how you train your AI so that it has values which are similar to yours.
It sounds to me like you should seriously consider doing work which might look like e.g. https://www.lesswrong.com/posts/eruHcdS9DmQsgLqd4/inducing-human-like-biases-in-moral-reasoning-lms, Getting aligned on representational alignment, Training language models to summarize narratives improves brain alignment; also see a lot of recent works from Ilia Sucholutsky and from this workshop.
yes, e.g. https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=GRjfMwLDFgw6qLnDv
I’m not sure anyone I know in mech interp is claiming this is a non-trivial problem.
Yeah sorry I should have been more precise. I think it’s so non-trivial that it plausibly contains most of the difficulty in the overall problem—which is a statement I think many people working on mechanistic interpretability would disagree with.