This post considers the Value Definition Problem: what should our AI system <@try to do@>(@Clarifying “AI Alignment”@), to have the best chance of a positive outcome? It argues that an answer to the problem should be judged based on how much easier it makes alignment, how competent the AI system has to be to optimize it, and how good the outcome would be if it was optimized. Solutions also differ on how “direct” they are—on one end, explicitly writing down a utility function would be very direct, while on the other, something like Coherent Extrapolated Volition would be very indirect: it delegates the task of figuring out what is good to the AI system itself.
Planned opinion:
I fall more on the side of preferring indirect approaches, though by that I mean that we should delegate to future humans, as opposed to defining some particular value-finding mechanism into an AI system that eventually produces a definition of values.
I appreciate the summary, though the way you state the VDP isn’t quite the way I meant it.
what should our AI system <@try to do@>(@Clarifying “AI Alignment”@), to have the best chance of a positive outcome?
To me, this reads like, ‘we have a particular AI, what should we try to get it to do’, wheras I meant it as ‘what Value Definition should we be building our AI to pursue’. So, that’s why I stated it as ′ what should we aim to get our AI to want/target/decide/do’ or, to be consistent with your way of writing it ‘what should we try to get our AI system to do to have the best chance of a positive outcome’, not ‘what should our AI system try to do to have the best chance of a positive outcome’. Aside from that minor terminological difference, that’s a good summary of what I was trying to say.
I fall more on the side of preferring indirect approaches, though by that I mean that we should delegate to future humans, as opposed to defining some particular value-finding mechanism into an AI system that eventually produces a definition of values.
I think your opinion is probably the majority opinion—my major point with the ‘scale of directness’ was to emphasize that our ‘particular value-finding mechanisms’ can have more or fewer degrees of freedom, since from a certain perspective ‘delegate everything to a simulation of future humans’ is also a ‘particular mechanism’ just with a lot more degrees of freedom, so even if you strongly favour indirect approaches you will still have to make some decisions about the nature of the delegation.
The original reason that I wrote this post was to get people to explicitly notice the point that we will probably have to do some philosophical labour ourselves at some point, and then I discovered Stuart Armstrong had already made a similar argument. I’m currently working on another post (also based on the same work at EA Hotel) with some more specific arguments about why we should construct a particular value-finding mechanism that doesn’t fix us to any particular normative ethical theory, but does fix us to an understanding of what values are—something I call a Coherent Extrapolated Framework (CEF). But again, Stuart Armstrong anticipated a lot (but not all!) of what I was going to say.
To me, this reads like, ‘we have a particular AI, what should we try to get it to do’
Hmm, I definitely didn’t intend it that way—I’m basically always talking about how to build AI systems, and I’d hope my readers see it that way too. But in any case, adding three words isn’t a big deal, I’ll change that.
(Though I think it is “what should we get our AI system to try to do”, as opposed to “what should we try to get our AI system to do”, right? The former is intent alignment, the latter is not.)
even if you strongly favour indirect approaches you will still have to make some decisions about the nature of the delegation
In some abstract sense, certainly. But it could be “I’ll take no action; whatever future humanity decides on will be what happens”. This is in some sense a decision about the nature of the delegation, but not a huge one. (You could also imagine believing that delegating will be fine for a wide variety of delegation procedures, and so you aren’t too worried which one gets used.)
For example, perhaps we solve intent alignment in a value-neutral way (that is, the resulting AI system tries to figure out the values of its operator and then satisfy them, and can do so for most operators), and then every human gets an intent aligned AGI, this leads to a post-scarcity world, and then all of the future humans figure out what they as a society care about (the philosophical labor) and then that is optimized.
Of course, the philosophical labor did eventually happen, but the point is that it happened well after AGI, and pre-AGI nothing major needed to be done to delegate to the future humans.
The scenario where every human gets an intent-aligned AGI, and each AGI learns their own particular values would be a case where each individual AGI is following something like ‘Distilled Human Preferences’, or possibly just ‘Ambitious Learned Value Function’ as its Value Definition, so a fairly Direct scenario. However, the overall outcome would be more towards the indirect end—because a multipolar world with lots of powerful Humans using AGIs and trying to compromise would (you anticipate) end up converging on our CEV, or Moral Truth, or something similar. I didn’t consider direct vs indirect in the context of multipolar scenarios like this (nor did Bostrom, I think) but it seems sufficient to just say that the individual AGIs use a fairly direct Value Definition while the outcome is indirect.
Planned summary:
Planned opinion:
I appreciate the summary, though the way you state the VDP isn’t quite the way I meant it.
To me, this reads like, ‘we have a particular AI, what should we try to get it to do’, wheras I meant it as ‘what Value Definition should we be building our AI to pursue’. So, that’s why I stated it as ′ what should we aim to get our AI to want/target/decide/do’ or, to be consistent with your way of writing it ‘what should we try to get our AI system to do to have the best chance of a positive outcome’, not ‘what should our AI system try to do to have the best chance of a positive outcome’. Aside from that minor terminological difference, that’s a good summary of what I was trying to say.
I think your opinion is probably the majority opinion—my major point with the ‘scale of directness’ was to emphasize that our ‘particular value-finding mechanisms’ can have more or fewer degrees of freedom, since from a certain perspective ‘delegate everything to a simulation of future humans’ is also a ‘particular mechanism’ just with a lot more degrees of freedom, so even if you strongly favour indirect approaches you will still have to make some decisions about the nature of the delegation.
The original reason that I wrote this post was to get people to explicitly notice the point that we will probably have to do some philosophical labour ourselves at some point, and then I discovered Stuart Armstrong had already made a similar argument. I’m currently working on another post (also based on the same work at EA Hotel) with some more specific arguments about why we should construct a particular value-finding mechanism that doesn’t fix us to any particular normative ethical theory, but does fix us to an understanding of what values are—something I call a Coherent Extrapolated Framework (CEF). But again, Stuart Armstrong anticipated a lot (but not all!) of what I was going to say.
Hmm, I definitely didn’t intend it that way—I’m basically always talking about how to build AI systems, and I’d hope my readers see it that way too. But in any case, adding three words isn’t a big deal, I’ll change that.
(Though I think it is “what should we get our AI system to try to do”, as opposed to “what should we try to get our AI system to do”, right? The former is intent alignment, the latter is not.)
In some abstract sense, certainly. But it could be “I’ll take no action; whatever future humanity decides on will be what happens”. This is in some sense a decision about the nature of the delegation, but not a huge one. (You could also imagine believing that delegating will be fine for a wide variety of delegation procedures, and so you aren’t too worried which one gets used.)
For example, perhaps we solve intent alignment in a value-neutral way (that is, the resulting AI system tries to figure out the values of its operator and then satisfy them, and can do so for most operators), and then every human gets an intent aligned AGI, this leads to a post-scarcity world, and then all of the future humans figure out what they as a society care about (the philosophical labor) and then that is optimized.
Of course, the philosophical labor did eventually happen, but the point is that it happened well after AGI, and pre-AGI nothing major needed to be done to delegate to the future humans.
The scenario where every human gets an intent-aligned AGI, and each AGI learns their own particular values would be a case where each individual AGI is following something like ‘Distilled Human Preferences’, or possibly just ‘Ambitious Learned Value Function’ as its Value Definition, so a fairly Direct scenario. However, the overall outcome would be more towards the indirect end—because a multipolar world with lots of powerful Humans using AGIs and trying to compromise would (you anticipate) end up converging on our CEV, or Moral Truth, or something similar. I didn’t consider direct vs indirect in the context of multipolar scenarios like this (nor did Bostrom, I think) but it seems sufficient to just say that the individual AGIs use a fairly direct Value Definition while the outcome is indirect.