[Note: I believe that this proposal is primarily talking about Value-Alignment Target Analysis (VATA), not Intent-Alignment Target Analysis. I think this distinction is important, so I will emphasize it.]
I’m a believer in the wisdom of aiming for intent-aligned corrigible agents (CAST) instead of aiming directly for a value-aligned agent or value-aligned sovereign AI. I am in agreement with you that current proposals around Value-Alignment targets seem likely disastrous.
Questions:
Will we be able to successfully develop corrigible agents in time to be strategically relevant? Or will more risk tolerant, less thoughtful people race to powerful AI first and cause disaster?
What could we humble researchers do about this?
Would focusing on Value-Alignment Target Analysis help?
Will current leading labs correctly adopt and implement the recommendations made by Alignment Target Analysis?
If we do have corrigible agents, and also an international coalition to prevent rogue AGI and self-replicating weapons, and do mutual inspections to prevent any member from secretly defecting and racing toward more powerful AI....
How long do we expect that wary peace to hold?
Will we succeed at pausing for a year or more like ten years? My current guess is that something like 10 years may be feasible.
Thoughts
My current expectation is that we should plan on what to do in the next 2-3 critical years, and prioritize what seems most likely to get to a stable AI-progress-delay situation.
I feel unconvinced that launching a powerful value-locking-in sovereign AI that the launchers hope is Value Aligned with them is something that the key decision makers will be tempted by. My mental model of the likely actors in such a scenario (e.g. US Federal Government & Armed Forces) is that they would be extremely reluctant to irreversibly hand over power to an AI, even if they felt confident that the AI was aligned to their values. I also don’t model them as being interested in accepting a recommendation for a Value Alignment Target which seemed to them to compromise any of their values in favor of other peoples values. I model the Armed Forces (not just US, but as a general pattern everywhere) as tending to be quite confident in the correctness of their own values and the validity of coercing others to obey their values (selection effect strong here).
I do agree that I don’t trust the idea of attempting to rapidly accelerate Value-Alignment Target Research with AI assistants. It seems like the sort of research I’d want to be done carefully, with lots of input from a diverse set of people from many cultures around the world. That doesn’t seem like the sort of problem amenable to manyfold acceleration by AI R&D.
In conclusion:
I do support the idea of you working on Value-Alignment Target Analysis Research, and attempting to recruit people to work on it worth you who aren’t currently working on AI safety. I am not yet convinced that it makes sense to shift current AI safety/alignment researchers (including myself) from what they are currently doing to VATA.
However, I’m open to discussion. If you change my mind on some set of the points above, it would likely change my point of view.
Also, I think you are mistaken in saying that nobody is working on this. I think it’s more the case that people working on this don’t want to say that this is what they are working on. Admitting that you are aiming for a Sovereign AI is obviously not a very politically popular statement. People might rightly worry that you’d give such an AI toward the Value-Alignment Target you personally believe in (which may in turn be biased towards your personal values). If someone who didn’t like the proposals you were putting forth because they felt that their own values were underrepresented, they could be tempted to enter into conflict with you or try to beat your Sovereign AI team to the punch by rushing their own Sovereign AI team (or sabotaging yours). So probably there are people out there thinking about this but keeping silent about it.
I do think that there is more general work being done on related ideas of measuring the values of large heterogeneous groups of people, or trying to understand the nature of values and ethics. I don’t think you should write this work off too quickly, since I think a lot of it pertains quite directly to VATA.
Some recent examples:
(note that I’ve read most of these, and some of them express contradictory views, and I don’t feel like any of them perfectly represents my own views. I just want to share some of what I’ve been reading on the topic.)
The proposed research project would indeed be focused on a certain type of alignment target. For example proposals along the lines of PCEV. But not proposals along the lines of a tool-AI. Referring to this as Value-Alignment Target Analysis (VATA) would also be a possible notation. I will adopt this notation for the rest of this comment.
The proposed VATA research project would be aiming for risk mitigation. It would not be aiming for an answer:
There is a big difference between proposing an alignment target on the one hand. And pointing out problems with alignment targets on the other hand. For example: it is entirely possible to reduce risks from a dangerous alignment target, without having any idea how one might find a good alignment target. One can actually reduce risks without having any idea, what it even means for an alignment target to be a good alignment target.
The feature of PCEV mentioned in the post is an example of this. The threat posed by PCEV has presumably been mostly removed. This did not require anything along the lines of an answer. The analysis of Condorcet AI (CAI) is similar. The analysis simply describes a feature shared by all CAI proposals (the feature that a barely caring solid majority can do whatever they want with everyone else). Pointing this out presumably reduces the probability that a CAI will be launched by designers that never considered this feature. All claims made in the post about a VATA research project being tractable is referring to this type of risk mitigation being tractable. There is definitely no claim that a VATA research project can (i): find a good alignment target, (ii): somehow verify that this alignment target does not have any hidden flaws, and (iii): convince whoever is in charge to launch this target.
One can also go a bit beyond analysis of individual proposals, even if one does not have any idea how to find an answer. One can mitigate risk by describing necessary features (for example along the lines of this necessary Membrane formalism feature). This reduces risks from all proposal that clearly does not have such a necessary feature.
(and just to be extra clear: the post is not arguing that launching a Sovereign AI is a good idea. The post is assuming an audience that agree that it is possible that a Sovereign AI might be launched. And then the post is arguing that if this does happen, then there is a risk that such a Sovereign AI project will be aiming at a bad value alignment target. The post then further argues that this particular risk can be reduced by doing VATA)
Regarding people being skeptical of Value Alignment Target proposals:
If someone ends up with the capability to launch a Sovereign AI, then I certainly hope that they will be skeptical of proposed Value Alignment Targets. Such skepticism can avert catastrophe even if the proposed alignment target has a flaw that no one has noticed.
The issue is that a situation might arise where (i): someone has the ability to launch a Sovereign AI, (ii): there exists a Sovereign AI proposal that no one can find any flaws with, and (iii): there is a time crunch.
Regarding the possibility that there exists people trying to find an answer without telling anyone:
I’m not sure how to estimate the probability of this. From a risk mitigation standpoint, this is certainly not the optimal way of doing things (if a proposed alignment target has a flaw, then it will be a lot easier to notice that flaw, if the proposal is not kept secret). I really don’t think that this is a reasonable way of doing things. But I think that you have a point. If Bob is about to launch an AI Sovereign with some critical flaw that would lead to some horrific outcome. Then secretly working Steve might be able to notice this flaw. And if Bob is just about to launch his AI, and speaking up is the only way for Steve to prevent Bob from causing a catastrophe, then Steve will presumably speak up. In other words: the existence of people like secretly working Steve would indeed offer some level of protection. It would mean that the lack of people with relevant intuitions is not as bad as it appears (and when allocating resources, this possibility would indeed point to less resources for VATA). But I think that what is really needed is at least some people doing VATA with a clear risk mitigation focus. And discussing their finding with each other. This does not appear to exist.
Regarding other risks, and the issue that findings might be ignored:
A VATA research project would not help with misalignment. In other words: even if the field of VATA was somehow completely solved tomorrow, AI could still lead to extinction. So the proposed research project is definitely not dealing with all risks. The point of the post is that the field of VATA is basically empty. I don’t know of anyone that is doing VATA full time with a clear risk mitigation focus. And I don’t know if you personally should switch to focusing on VATA. It would not surprise me at all if some other project is a better use of your time. It just seems like there should exist some form of VATA research project with a clear risk mitigation focus.
It is also possible that a VATA finding will be completely ignored (by leading labs, or by governments, or by someone else). It is possible that a Sovereign AI will be launched, leading to catastrophe, even though it has a known flaw (because the people launching it is just refusing to listen). But finding a flaw at least means that it is possible to avert catastrophe.
PS:
Thanks for the links! I will look into this. (I think that there are many fields of research that are relevant to VATA. It’s just that one has to be careful. A concept can behave very differently when it is transferred to the AI context)
[Note: I believe that this proposal is primarily talking about Value-Alignment Target Analysis (VATA), not Intent-Alignment Target Analysis. I think this distinction is important, so I will emphasize it.]
I’m a believer in the wisdom of aiming for intent-aligned corrigible agents (CAST) instead of aiming directly for a value-aligned agent or value-aligned sovereign AI. I am in agreement with you that current proposals around Value-Alignment targets seem likely disastrous.
Questions:
Will we be able to successfully develop corrigible agents in time to be strategically relevant? Or will more risk tolerant, less thoughtful people race to powerful AI first and cause disaster?
What could we humble researchers do about this?
Would focusing on Value-Alignment Target Analysis help?
Will current leading labs correctly adopt and implement the recommendations made by Alignment Target Analysis?
If we do have corrigible agents, and also an international coalition to prevent rogue AGI and self-replicating weapons, and do mutual inspections to prevent any member from secretly defecting and racing toward more powerful AI....
How long do we expect that wary peace to hold?
Will we succeed at pausing for a year or more like ten years? My current guess is that something like 10 years may be feasible.
Thoughts
My current expectation is that we should plan on what to do in the next 2-3 critical years, and prioritize what seems most likely to get to a stable AI-progress-delay situation.
I feel unconvinced that launching a powerful value-locking-in sovereign AI that the launchers hope is Value Aligned with them is something that the key decision makers will be tempted by. My mental model of the likely actors in such a scenario (e.g. US Federal Government & Armed Forces) is that they would be extremely reluctant to irreversibly hand over power to an AI, even if they felt confident that the AI was aligned to their values. I also don’t model them as being interested in accepting a recommendation for a Value Alignment Target which seemed to them to compromise any of their values in favor of other peoples values. I model the Armed Forces (not just US, but as a general pattern everywhere) as tending to be quite confident in the correctness of their own values and the validity of coercing others to obey their values (selection effect strong here).
I do agree that I don’t trust the idea of attempting to rapidly accelerate Value-Alignment Target Research with AI assistants. It seems like the sort of research I’d want to be done carefully, with lots of input from a diverse set of people from many cultures around the world. That doesn’t seem like the sort of problem amenable to manyfold acceleration by AI R&D.
In conclusion:
I do support the idea of you working on Value-Alignment Target Analysis Research, and attempting to recruit people to work on it worth you who aren’t currently working on AI safety. I am not yet convinced that it makes sense to shift current AI safety/alignment researchers (including myself) from what they are currently doing to VATA.
However, I’m open to discussion. If you change my mind on some set of the points above, it would likely change my point of view.
Also, I think you are mistaken in saying that nobody is working on this. I think it’s more the case that people working on this don’t want to say that this is what they are working on. Admitting that you are aiming for a Sovereign AI is obviously not a very politically popular statement. People might rightly worry that you’d give such an AI toward the Value-Alignment Target you personally believe in (which may in turn be biased towards your personal values). If someone who didn’t like the proposals you were putting forth because they felt that their own values were underrepresented, they could be tempted to enter into conflict with you or try to beat your Sovereign AI team to the punch by rushing their own Sovereign AI team (or sabotaging yours). So probably there are people out there thinking about this but keeping silent about it.
I do think that there is more general work being done on related ideas of measuring the values of large heterogeneous groups of people, or trying to understand the nature of values and ethics. I don’t think you should write this work off too quickly, since I think a lot of it pertains quite directly to VATA.
Some recent examples:
(note that I’ve read most of these, and some of them express contradictory views, and I don’t feel like any of them perfectly represents my own views. I just want to share some of what I’ve been reading on the topic.)
https://www.lesswrong.com/posts/As7bjEAbNpidKx6LR/valence-series-1-introduction
https://www.lesswrong.com/posts/YgaPhcrkqnLrTzQPG/we-don-t-know-our-own-values-but-reward-bridges-the-is-ought
https://arxiv.org/abs/2404.10636
https://arxiv.org/html/2405.17345v1
https://medium.com/nerd-for-tech/openais-groundbreaking-research-into-moral-alignment-for-llms-7c4e0e5ffe97
https://futureoflife.org/ai/align-artificial-intelligence-with-human-values/
https://arxiv.org/abs/2311.17017
https://arxiv.org/abs/2309.00779
https://www.pnas.org/doi/10.1073/pnas.2213709120
https://link.springer.com/article/10.1007/s43681-022-00188-y
The proposed research project would indeed be focused on a certain type of alignment target. For example proposals along the lines of PCEV. But not proposals along the lines of a tool-AI. Referring to this as Value-Alignment Target Analysis (VATA) would also be a possible notation. I will adopt this notation for the rest of this comment.
The proposed VATA research project would be aiming for risk mitigation. It would not be aiming for an answer:
There is a big difference between proposing an alignment target on the one hand. And pointing out problems with alignment targets on the other hand. For example: it is entirely possible to reduce risks from a dangerous alignment target, without having any idea how one might find a good alignment target. One can actually reduce risks without having any idea, what it even means for an alignment target to be a good alignment target.
The feature of PCEV mentioned in the post is an example of this. The threat posed by PCEV has presumably been mostly removed. This did not require anything along the lines of an answer. The analysis of Condorcet AI (CAI) is similar. The analysis simply describes a feature shared by all CAI proposals (the feature that a barely caring solid majority can do whatever they want with everyone else). Pointing this out presumably reduces the probability that a CAI will be launched by designers that never considered this feature. All claims made in the post about a VATA research project being tractable is referring to this type of risk mitigation being tractable. There is definitely no claim that a VATA research project can (i): find a good alignment target, (ii): somehow verify that this alignment target does not have any hidden flaws, and (iii): convince whoever is in charge to launch this target.
One can also go a bit beyond analysis of individual proposals, even if one does not have any idea how to find an answer. One can mitigate risk by describing necessary features (for example along the lines of this necessary Membrane formalism feature). This reduces risks from all proposal that clearly does not have such a necessary feature.
(and just to be extra clear: the post is not arguing that launching a Sovereign AI is a good idea. The post is assuming an audience that agree that it is possible that a Sovereign AI might be launched. And then the post is arguing that if this does happen, then there is a risk that such a Sovereign AI project will be aiming at a bad value alignment target. The post then further argues that this particular risk can be reduced by doing VATA)
Regarding people being skeptical of Value Alignment Target proposals:
If someone ends up with the capability to launch a Sovereign AI, then I certainly hope that they will be skeptical of proposed Value Alignment Targets. Such skepticism can avert catastrophe even if the proposed alignment target has a flaw that no one has noticed.
The issue is that a situation might arise where (i): someone has the ability to launch a Sovereign AI, (ii): there exists a Sovereign AI proposal that no one can find any flaws with, and (iii): there is a time crunch.
Regarding the possibility that there exists people trying to find an answer without telling anyone:
I’m not sure how to estimate the probability of this. From a risk mitigation standpoint, this is certainly not the optimal way of doing things (if a proposed alignment target has a flaw, then it will be a lot easier to notice that flaw, if the proposal is not kept secret). I really don’t think that this is a reasonable way of doing things. But I think that you have a point. If Bob is about to launch an AI Sovereign with some critical flaw that would lead to some horrific outcome. Then secretly working Steve might be able to notice this flaw. And if Bob is just about to launch his AI, and speaking up is the only way for Steve to prevent Bob from causing a catastrophe, then Steve will presumably speak up. In other words: the existence of people like secretly working Steve would indeed offer some level of protection. It would mean that the lack of people with relevant intuitions is not as bad as it appears (and when allocating resources, this possibility would indeed point to less resources for VATA). But I think that what is really needed is at least some people doing VATA with a clear risk mitigation focus. And discussing their finding with each other. This does not appear to exist.
Regarding other risks, and the issue that findings might be ignored:
A VATA research project would not help with misalignment. In other words: even if the field of VATA was somehow completely solved tomorrow, AI could still lead to extinction. So the proposed research project is definitely not dealing with all risks. The point of the post is that the field of VATA is basically empty. I don’t know of anyone that is doing VATA full time with a clear risk mitigation focus. And I don’t know if you personally should switch to focusing on VATA. It would not surprise me at all if some other project is a better use of your time. It just seems like there should exist some form of VATA research project with a clear risk mitigation focus.
It is also possible that a VATA finding will be completely ignored (by leading labs, or by governments, or by someone else). It is possible that a Sovereign AI will be launched, leading to catastrophe, even though it has a known flaw (because the people launching it is just refusing to listen). But finding a flaw at least means that it is possible to avert catastrophe.
PS:
Thanks for the links! I will look into this. (I think that there are many fields of research that are relevant to VATA. It’s just that one has to be careful. A concept can behave very differently when it is transferred to the AI context)
Adding more resources :
[Clearer Thinking with Spencer Greenberg] Aligning society with our deepest values and sources of meaning (with Joe Edelman) #clearerThinkingWithSpencerGreenberg https://podcastaddict.com/clearer-thinking-with-spencer-greenberg/episode/175816099
https://podcastaddict.com/joe-carlsmith-audio/episode/157573278