I’m leaning towards the more ambitious version of the project of AI alignment being about corrigible anti-goodharting, with the AI optimizing towards good trajectories within scope of relatively well-understood values, preventing overoptimized weird/controversial situations, even at the cost of astronomical waste. Absence of x-risks, including AI risks, is generally good. Within this environment, the civilization might be able to eventually work out more about values, expanding the scope of their definition and thus allowing stronger optimization. Here corrigibility is in part about continually picking up the values and their implied scope from the predictions of how they would’ve been worked out some time in the future.
I’m leaning towards the more ambitious version of the project of AI alignment being about corrigible anti-goodharting, with the AI optimizing towards good trajectories within scope of relatively well-understood values
Please say more about this? What are some examples of “relatively well-understood values”, and what kind of AI do you have in mind that can potentially safely optimize “towards good trajectories within scope” of these values?
My point is that the alignment (values) part of AI alignment is least urgent/relevant to the current AI risk crisis. It’s all about corrigibility and anti-goodharting. Corrigibility is hope for eventual alignment, and anti-goodharting makes inadequacy of current alignment and imperfect robustness of corrigibility less of a problem. I gave the relevant example of relatively well-understood values, preference for lower x-risks. Other values are mostly relevant in how their understanding determines the boundary of anti-goodharting, what counts as not too weird for them to apply, not in what they say is better. If anti-goodharting holds (too weird and too high impact situations are not pursued in planning and possibly actively discouraged), and some sort of long reflection is still going on, current alignment (details of what the values-in-AI prefer, as opposed to what they can make sense of) doesn’t matter in the long run.
I include maintaining a well-designed long reflection somewhere into corrigibility, for without it there is no hope for eventual alignment, so a decision theoretic agent that has long reflection within its preference is corrigible in this sense. Its corrigibility depends on following a good decision theory, so that there actually exists a way for the long reflection to determine its preference so that it causes the agent to act as the long reflection wishes. But being an optimizer it’s horribly not anti-goodharting, so can’t be stopped and probably eats everything else.
An AI with anti-goodharting turned to the max is the same as AI with its stop button pressed. An AI with minimal anti-goodharting is an optimizer, AI risk incarnate. Stronger anti-goodharting is a maintenance mode, opportunity for fundamental change, weaker anti-goodharting makes use of more developed values to actually do the things. So a way to control the level of anti-goodharting in an AI is a corrigibility technique. The two concepts work well with each other.
This seems interesting and novel to me, but (of course) I’m still skeptical.
I gave the relevant example of relatively well-understood values, preference for lower x-risks.
Preference for lower x-risk doesn’t seem “well-understood” to me, if we include in “x-risk” things like value drift/corruption, premature value lock-in, and other highly consequential AI-enabled decisions (potential existential mistakes) that depend on hard philosophical questions. I gave some specific examples in this recent comment. What do you think about the problems on that list? (Do you agree that they are serious problems, and if so how do you envision them being solved or prevented in your scenario?)
The fact that AI alignment research is 99% about control, and 1% (maybe less?) about metaethics (In the context of how do we even aggregate the utility function of all humanity) hints at what is really going on, and that’s enough said.
Have you heard about CEV and Fun Theory? In an earlier, more optimistic time, this was indeed a major focus. What changed is we became more pessimistic and decided to focus more on first things first—if you can’t control the AI at all, it doesn’t matter what metaethics research you’ve done. Also, the longtermist EA community still thinks a lot about metaethics relative to literally every other community I know of, on par with and perhaps slightly more than my philosophy grad student friends. (That’s my take at any rate, I haven’t been around that long.)
CEV was written in 2004, fun theory 13 years ago. I couldn’t find any recent MIRI paper that was about metaethics (Granted I haven’t gone through all of them). The metaethics question is important just as much as the control question for any utilitarian (What good will it be to control an AI only for it to be aligned with some really bad values, an AI-controlled by a sadistic sociopath is infinitely worse than a paper-clip-maximizer). Yet all the research is focused on control, and it’s very hard not to be cynical about it. If some people believe they are creating a god, it’s selfishly prudent to make sure you’re the one holding the reigns to this god. I don’t get why having some blind trust in the benevolence of Peter Thiel (who finances this) or other people who will suddenly have godly powers to care for all humanity seems naive with all we know about how power corrupts and how competitive and selfish people are. Most people are not utilitarians, so as a quasi-utilitarian I’m pretty terrified of what kind of world will be created with an AI-controlled by the typical non-utilitarian person.
My claim was not that MIRI is doing lots of work on metaethics. As far as I know they are focused on the control/alignment problem. This is not because they think it’s the only problem that needs solving; it’s just the most dire, the biggest bottleneck, in their opinion.
You may be interested to know that I share your concerns about what happens after (if) we succeed at solving alignment. So do many other people in the community, I assure you. (Though I agree on the margin more quiet awareness-raising about this would plausibly be good.)
I’m leaning towards the more ambitious version of the project of AI alignment being about corrigible anti-goodharting, with the AI optimizing towards good trajectories within scope of relatively well-understood values, preventing overoptimized weird/controversial situations, even at the cost of astronomical waste. Absence of x-risks, including AI risks, is generally good. Within this environment, the civilization might be able to eventually work out more about values, expanding the scope of their definition and thus allowing stronger optimization. Here corrigibility is in part about continually picking up the values and their implied scope from the predictions of how they would’ve been worked out some time in the future.
Please say more about this? What are some examples of “relatively well-understood values”, and what kind of AI do you have in mind that can potentially safely optimize “towards good trajectories within scope” of these values?
My point is that the alignment (values) part of AI alignment is least urgent/relevant to the current AI risk crisis. It’s all about corrigibility and anti-goodharting. Corrigibility is hope for eventual alignment, and anti-goodharting makes inadequacy of current alignment and imperfect robustness of corrigibility less of a problem. I gave the relevant example of relatively well-understood values, preference for lower x-risks. Other values are mostly relevant in how their understanding determines the boundary of anti-goodharting, what counts as not too weird for them to apply, not in what they say is better. If anti-goodharting holds (too weird and too high impact situations are not pursued in planning and possibly actively discouraged), and some sort of long reflection is still going on, current alignment (details of what the values-in-AI prefer, as opposed to what they can make sense of) doesn’t matter in the long run.
I include maintaining a well-designed long reflection somewhere into corrigibility, for without it there is no hope for eventual alignment, so a decision theoretic agent that has long reflection within its preference is corrigible in this sense. Its corrigibility depends on following a good decision theory, so that there actually exists a way for the long reflection to determine its preference so that it causes the agent to act as the long reflection wishes. But being an optimizer it’s horribly not anti-goodharting, so can’t be stopped and probably eats everything else.
An AI with anti-goodharting turned to the max is the same as AI with its stop button pressed. An AI with minimal anti-goodharting is an optimizer, AI risk incarnate. Stronger anti-goodharting is a maintenance mode, opportunity for fundamental change, weaker anti-goodharting makes use of more developed values to actually do the things. So a way to control the level of anti-goodharting in an AI is a corrigibility technique. The two concepts work well with each other.
This seems interesting and novel to me, but (of course) I’m still skeptical.
Preference for lower x-risk doesn’t seem “well-understood” to me, if we include in “x-risk” things like value drift/corruption, premature value lock-in, and other highly consequential AI-enabled decisions (potential existential mistakes) that depend on hard philosophical questions. I gave some specific examples in this recent comment. What do you think about the problems on that list? (Do you agree that they are serious problems, and if so how do you envision them being solved or prevented in your scenario?)
The fact that AI alignment research is 99% about control, and 1% (maybe less?) about metaethics (In the context of how do we even aggregate the utility function of all humanity) hints at what is really going on, and that’s enough said.
Have you heard about CEV and Fun Theory? In an earlier, more optimistic time, this was indeed a major focus. What changed is we became more pessimistic and decided to focus more on first things first—if you can’t control the AI at all, it doesn’t matter what metaethics research you’ve done. Also, the longtermist EA community still thinks a lot about metaethics relative to literally every other community I know of, on par with and perhaps slightly more than my philosophy grad student friends. (That’s my take at any rate, I haven’t been around that long.)
CEV was written in 2004, fun theory 13 years ago. I couldn’t find any recent MIRI paper that was about metaethics (Granted I haven’t gone through all of them). The metaethics question is important just as much as the control question for any utilitarian (What good will it be to control an AI only for it to be aligned with some really bad values, an AI-controlled by a sadistic sociopath is infinitely worse than a paper-clip-maximizer). Yet all the research is focused on control, and it’s very hard not to be cynical about it. If some people believe they are creating a god, it’s selfishly prudent to make sure you’re the one holding the reigns to this god. I don’t get why having some blind trust in the benevolence of Peter Thiel (who finances this) or other people who will suddenly have godly powers to care for all humanity seems naive with all we know about how power corrupts and how competitive and selfish people are. Most people are not utilitarians, so as a quasi-utilitarian I’m pretty terrified of what kind of world will be created with an AI-controlled by the typical non-utilitarian person.
My claim was not that MIRI is doing lots of work on metaethics. As far as I know they are focused on the control/alignment problem. This is not because they think it’s the only problem that needs solving; it’s just the most dire, the biggest bottleneck, in their opinion.
You may be interested to know that I share your concerns about what happens after (if) we succeed at solving alignment. So do many other people in the community, I assure you. (Though I agree on the margin more quiet awareness-raising about this would plausibly be good.)
http://www.metaethical.ai is the state of the art as far as I’m concerned…