First is that I don’t really expect us to come up with a fully general answer to this problem in time. I wouldn’t be surprised if we had to trade off some generality for indexing on the system in front of us—this gets us some degree of non-robustness, but hopefully enough to buy us a lot more time before stuff like the problem behind deep deception breaks a lack of True Names. Hopefully then we can get the AI systems to solve the harder problem for us in the time we’ve bought, with systems more powerful than us. The relevance here is that if this is the case, then trying to generalize our findings to an entirely non-ML setting, while definitely something we want, might not be something we get, and maybe it makes sense to index lightly on a particular paradigm if the general problem seems really hard.
Second is that this claim as stated in your reply seems like a softer one (that I agree with more) than the one made in the overall post.
The answer does not need to be fully general, it just needs to be general for ML systems, and human brains. In worlds where this is hard, where the values AIs have are fundamentally different from the values humans have, such that there is no corresponding values-like concept in human value space which doesn’t route through a utility function which the AI’s values can be mapped on to, your AI is Definitely For Sure Not Aligned with you. Maybe its corrigible, but I don’t think that’s the situation you have in mind.
My claim is that neuroscience turns value alignment into a technical problem, and therefore some neuroscience research is useful. I don’t discuss the policy implications of this observation.
If I had access to a neuroscience or tech lab (and the relevant skills), I’d be doing that rather than ML.
When I think of how I approach solving this problem in practice, I think of interfacing with structures within ML systems that satisfy an increasing list of desiderata for values, covering the rest with standard mech interp techniques, and then steering them with human preferences. I certainly think it’s probable that there are valuable insights to this process from neuroscience, but I don’t think a good solution to this problem (under the constraints I mention above) requires that it be general to the human brain as well. We steer the system with our preferences (and interfacing with internal objectives seems to avoid the usual problems with preferences) - while something that allows us to actually directly translate our values from our system to theirs would be great, I expect that constraint of generality to make it harder than necessary.
I think that it’s a technical problem too—maybe you mean we don’t have to answer weird philosophical problems? But then I would say that we still have to answer the question of what we want to interface with / how to do top-down neuroscience, which I would call a technical problem but which I think routes through things that I think you may consider philosophical problems (of the nature “what do we want to interface with / what do we mean by values in this system?”)
When I think of how I approach solving this problem in practice, I think of interfacing with structures within ML systems that satisfy an increasing list of desiderata for values, covering the rest with standard mech interp techniques, and then steering them with human preferences. I certainly think it’s probable that there are valuable insights to this process from neuroscience, but I don’t think a good solution to this problem (under the constraints I mention above) requires that it be general to the human brain as well. We steer the system with our preferences (and interfacing with internal objectives seems to avoid the usual problems with preferences) - while something that allows us to actually directly translate our values from our system to theirs would be great, I expect that constraint of generality to make it harder than necessary.
I guess I’m just like, no matter what you do, you’re going to need to translate human values into AI values. The methodology you’re proposing is some kind of steering sort of thing, where you have knobs, I’m assuming, which you can turn to emphasize or de-emphasize certain values inside of your ML system. And there’s a particular setting of these knobs, which get you human values, and your job is to figure out what that setting of the knobs is.
I think this works fine for worlds where alignment is pretty easy. This sounds a lot like Alex Turner’s current plan, but I don’t think it works well in worlds where alignment is hard. In worlds where alignment is hard, it’s not necessarily guaranteed that the AI will even have values which are close to your own values, and you may need to do some interventions into your AI or rethink how you train your AI so that it has values which are similar to yours.
First is that I don’t really expect us to come up with a fully general answer to this problem in time. I wouldn’t be surprised if we had to trade off some generality for indexing on the system in front of us—this gets us some degree of non-robustness, but hopefully enough to buy us a lot more time before stuff like the problem behind deep deception breaks a lack of True Names. Hopefully then we can get the AI systems to solve the harder problem for us in the time we’ve bought, with systems more powerful than us. The relevance here is that if this is the case, then trying to generalize our findings to an entirely non-ML setting, while definitely something we want, might not be something we get, and maybe it makes sense to index lightly on a particular paradigm if the general problem seems really hard.
Yeah, that’s fair. Two minor points then:
First is that I don’t really expect us to come up with a fully general answer to this problem in time. I wouldn’t be surprised if we had to trade off some generality for indexing on the system in front of us—this gets us some degree of non-robustness, but hopefully enough to buy us a lot more time before stuff like the problem behind deep deception breaks a lack of True Names. Hopefully then we can get the AI systems to solve the harder problem for us in the time we’ve bought, with systems more powerful than us. The relevance here is that if this is the case, then trying to generalize our findings to an entirely non-ML setting, while definitely something we want, might not be something we get, and maybe it makes sense to index lightly on a particular paradigm if the general problem seems really hard.
Second is that this claim as stated in your reply seems like a softer one (that I agree with more) than the one made in the overall post.
The answer does not need to be fully general, it just needs to be general for ML systems, and human brains. In worlds where this is hard, where the values AIs have are fundamentally different from the values humans have, such that there is no corresponding values-like concept in human value space which doesn’t route through a utility function which the AI’s values can be mapped on to, your AI is Definitely For Sure Not Aligned with you. Maybe its corrigible, but I don’t think that’s the situation you have in mind.
My claim is that neuroscience turns value alignment into a technical problem, and therefore some neuroscience research is useful. I don’t discuss the policy implications of this observation.
If I had access to a neuroscience or tech lab (and the relevant skills), I’d be doing that rather than ML.
When I think of how I approach solving this problem in practice, I think of interfacing with structures within ML systems that satisfy an increasing list of desiderata for values, covering the rest with standard mech interp techniques, and then steering them with human preferences. I certainly think it’s probable that there are valuable insights to this process from neuroscience, but I don’t think a good solution to this problem (under the constraints I mention above) requires that it be general to the human brain as well. We steer the system with our preferences (and interfacing with internal objectives seems to avoid the usual problems with preferences) - while something that allows us to actually directly translate our values from our system to theirs would be great, I expect that constraint of generality to make it harder than necessary.
I think that it’s a technical problem too—maybe you mean we don’t have to answer weird philosophical problems? But then I would say that we still have to answer the question of what we want to interface with / how to do top-down neuroscience, which I would call a technical problem but which I think routes through things that I think you may consider philosophical problems (of the nature “what do we want to interface with / what do we mean by values in this system?”)
I think the seeds of an interdisciplinary agenda on this are already there, see e.g. https://manifund.org/projects/activation-vector-steering-with-bci, https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe?commentId=WLCcQS5Jc7NNDqWi5, https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe?commentId=D6NCcYF7Na5bpF5h5, https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=A8muL55dYxR3tv5wp and maybe my other comments on this post.
I might have a shortform going into more details on this soon, ot at least by the time of https://foresight.org/2024-foresight-neurotech-bci-and-wbe-for-safe-ai-workshop/.
I guess I’m just like, no matter what you do, you’re going to need to translate human values into AI values. The methodology you’re proposing is some kind of steering sort of thing, where you have knobs, I’m assuming, which you can turn to emphasize or de-emphasize certain values inside of your ML system. And there’s a particular setting of these knobs, which get you human values, and your job is to figure out what that setting of the knobs is.
I think this works fine for worlds where alignment is pretty easy. This sounds a lot like Alex Turner’s current plan, but I don’t think it works well in worlds where alignment is hard. In worlds where alignment is hard, it’s not necessarily guaranteed that the AI will even have values which are close to your own values, and you may need to do some interventions into your AI or rethink how you train your AI so that it has values which are similar to yours.
It sounds to me like you should seriously consider doing work which might look like e.g. https://www.lesswrong.com/posts/eruHcdS9DmQsgLqd4/inducing-human-like-biases-in-moral-reasoning-lms, Getting aligned on representational alignment, Training language models to summarize narratives improves brain alignment; also see a lot of recent works from Ilia Sucholutsky and from this workshop.
yes, e.g. https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=GRjfMwLDFgw6qLnDv