That does indeed answer my 3 concerns (and Seth’s answer does as well). Overnight, I came up with 1 more concern.
What if AGI somewhere down the line overgoes a value drift. After all, looking at the evolution, it seems like our evolutionary goal was supposed to be “produce as many offsprings”. And in the recent years, we have strayed from this goal (and are currently much worse at it than our ancestors). Now, humans seem to have goals like “design a video game” or “settle in France” or “climb Everest”. What if AGI similarly changes its goals and values overtime? Is there are way to prevent that or at least be safeguarded against that?
I am afraid that if that happens, humans would, metaphorically speaking, stand in AGI’s way of climbing Everest.
The answer to this is that we’d rely on instrumental convergence to help us out, combined with adding more data/creating error-correcting mechanisms to prevent value drift from being a problem.
Reading from LessWrong wiki, it says “Instrumental convergence or convergent instrumental values is the theorized tendency for most sufficiently intelligent agents to pursue potentially unbounded instrumental goals such as self-preservation and resource acquisition”
It seems like it preserves exactly the goals we wouldn’t really need it to preserve (like resource acquisition). I am not sure how it would help us with preserving goals like ensuring humanity’s prosperity, which seem to be non-fundamental.
Ah, so you are basically saying that preserving current values is like a meta instrumental value for AGIs similar to self-preservation that is just kind of always there? I am not sure if I would agree with that (if I am correctly interpreting you) since, it seems like some philosophers are quite open to changing their current values.
I’d also say that at least some of the justification for changing values in philosophers/humans is because they believe the new values are closer to the moral reality/truth, which is an instrumental incentive.
To be clear, I’m not going to state confidently that this will happen (maybe something like instruction following ala @Seth Herd is used instead, such that the pointer is to the human giving the instructions, rather than having values instead), but this is at least reasonably plausible IMO.
Fair enough. Would you expect that AI would also try to move its values to the moral reality? (something that’s probably good for us, cause I wouldn’t expect human extinction to be a morally good thing)
The problem with that plan is that there are too many valid moral realities, so which one you do get is once again a consequence of alignment efforts.
To be clear, I’m not stating that it’s hard to get the AI to value what we value, but it’s not so brain-dead easy that we can make the AI find moral reality and then all will be well.
Noosphere, I am really, really thankful for your responses. You completely answered almost all (I am still not convinced about that strategy of avoiding value drift. I am probably going to post that one as a question to see if maybe other people have different strategies on preventing value drift) of the concerns that I had about alignment.
This discussion, significantly increased my knowledge. If I could triple upvote your answers, I would. Thank you! Thank you a lot!
That does indeed answer my 3 concerns (and Seth’s answer does as well). Overnight, I came up with 1 more concern.
What if AGI somewhere down the line overgoes a value drift. After all, looking at the evolution, it seems like our evolutionary goal was supposed to be “produce as many offsprings”. And in the recent years, we have strayed from this goal (and are currently much worse at it than our ancestors). Now, humans seem to have goals like “design a video game” or “settle in France” or “climb Everest”. What if AGI similarly changes its goals and values overtime? Is there are way to prevent that or at least be safeguarded against that?
I am afraid that if that happens, humans would, metaphorically speaking, stand in AGI’s way of climbing Everest.
The answer to this is that we’d rely on instrumental convergence to help us out, combined with adding more data/creating error-correcting mechanisms to prevent value drift from being a problem.
What would instrumental convergence mean in this case? I am not sure of what that means in this case.
In this case, it would mean the convergence to preserve your current values.
Reading from LessWrong wiki, it says “Instrumental convergence or convergent instrumental values is the theorized tendency for most sufficiently intelligent agents to pursue potentially unbounded instrumental goals such as self-preservation and resource acquisition”
It seems like it preserves exactly the goals we wouldn’t really need it to preserve (like resource acquisition). I am not sure how it would help us with preserving goals like ensuring humanity’s prosperity, which seem to be non-fundamental.
Yes, I admittedly want to point to something along the lines of preserving your current values being a plausibly major drive of AIs.
Ah, so you are basically saying that preserving current values is like a meta instrumental value for AGIs similar to self-preservation that is just kind of always there? I am not sure if I would agree with that (if I am correctly interpreting you) since, it seems like some philosophers are quite open to changing their current values.
Not always, but I’d say often.
I’d also say that at least some of the justification for changing values in philosophers/humans is because they believe the new values are closer to the moral reality/truth, which is an instrumental incentive.
To be clear, I’m not going to state confidently that this will happen (maybe something like instruction following ala @Seth Herd is used instead, such that the pointer is to the human giving the instructions, rather than having values instead), but this is at least reasonably plausible IMO.
Fair enough. Would you expect that AI would also try to move its values to the moral reality? (something that’s probably good for us, cause I wouldn’t expect human extinction to be a morally good thing)
The problem with that plan is that there are too many valid moral realities, so which one you do get is once again a consequence of alignment efforts.
To be clear, I’m not stating that it’s hard to get the AI to value what we value, but it’s not so brain-dead easy that we can make the AI find moral reality and then all will be well.
Noosphere, I am really, really thankful for your responses. You completely answered almost all (I am still not convinced about that strategy of avoiding value drift. I am probably going to post that one as a question to see if maybe other people have different strategies on preventing value drift) of the concerns that I had about alignment.
This discussion, significantly increased my knowledge. If I could triple upvote your answers, I would. Thank you! Thank you a lot!
P.S. Here is the link to the question that I posted.