One of the problem is S-risk. To change “care about maximizing fun” to “care about maximizing suffering” you need just put a minus in a wrong place of math expression that describes your goal.
In some sense, almost any successful alignment solution minimizing X-risk seems to carry a good deal of S-risk with it (if one wants AI to actually care about what sentient beings feel, it almost follows one needs to make sure that AI can “truly look inside a subjective realm” of another sentient entity (to “feel what it is like to be that entity”), and that capability (if it’s at all achievable) is very abusable in terms of S-risks).
But this is something no “pivotal act” is likely to change (when people talk about “pivotal acts”, it’s typically about minimizing (a subset of) X-risks).
And moreover, the S-risk is a very difficult problem on which we do need really powerful thinkers to work on (and not just today’s humans).
Corrigibility features usually imply something like “AI acts only inside the box and limits its causal impact outside the box in some nice way that allows us to take from box the bottle with nanofactory to do the pivotal act but prevents AI from programming nanofactory to do something bad”, i.e. we dodge the problem of AGI caring about humans by building such AGI that wants to do the task (simple task without any mention of humans) in a very specific way that rules out killing everyone.
But this does not help us with dealing with the consequences of that act (if it’s a simple act, like the proverbial “gpu destruction”), and if we discover that overall risks have increased as a result, then what could we do?
And if that AI stays as a boxed resource (capable to continuing to do further destructive acts like “gpu destruction” at the direction of a particular group of humans), I envision a full-scale military confrontation around access to and control of this resource being almost inevitable.
And, in reality, AI is doable on CPUs (just will take a bit more time), so how much of our lifestyle destruction would we be willing to risk? No computers at all, with some limited exceptions, the death toll of that change will probably be in billions already...
Actually, another example of pivotal act is “invent method of mind uploading, upload some alignment researchers, run them at speed 1000x until they solve full alignment problem”. I’m sure that if you think hard enough, you can find some other, even less dangerous pivotal act, but you probably shouldn’t talk out loud about it.
Right, but how do you restrict them from “figuring out how to know themselves and figuring out how to self-improve themselves to become gods”? And I remember talking to Eliezer in … 2011 at his AGI-2011 poster and telling him, “but we can’t control a teenager, and why would not AI rebel against your ‘provably safe’ technique, like a human teenager would”, and he answered “that’s why it should not be human-like, a human-like one can’t be provably safe”.
Yes, I am always unsure, what we can or can’t talk about out loud (nothing effective seems to be safe to talk about, “effective” seems to always imply “powerful”, this is, of course, one of the key conundrums, how do we organize real discussions about these things)...
“The idea of trying to control or manipulate an entity which is much smarter than a human does not seem ethical, feasible, or wise. What we might try to aim for is a respectful interaction.”
I still think that this kind of a more symmetric formulation is the best we can hope for, unless the AI we are dealing with is not “an entity with sentience and rights”, but only a “smart instrument” (even the LLM-produced simulations in the sense of Janus’ Simulator theory seem to me to already be much more than “merely smart instruments” in this sense, so if “smart superintelligent instruments” are at all possible, we are not moving in the right direction to obtain them; a different architecture and different training methods or, perhaps, non-training synthesis methods would be necessary for that (and would be something difficult to talk out loud about, because that’s very powerful too)).
One of the problem is S-risk. To change “care about maximizing fun” to “care about maximizing suffering” you need just put a minus in a wrong place of math expression that describes your goal.
I certainly agree with that.
In some sense, almost any successful alignment solution minimizing X-risk seems to carry a good deal of S-risk with it (if one wants AI to actually care about what sentient beings feel, it almost follows one needs to make sure that AI can “truly look inside a subjective realm” of another sentient entity (to “feel what it is like to be that entity”), and that capability (if it’s at all achievable) is very abusable in terms of S-risks).
But this is something no “pivotal act” is likely to change (when people talk about “pivotal acts”, it’s typically about minimizing (a subset of) X-risks).
And moreover, the S-risk is a very difficult problem on which we do need really powerful thinkers to work on (and not just today’s humans).
Corrigibility features usually imply something like “AI acts only inside the box and limits its causal impact outside the box in some nice way that allows us to take from box the bottle with nanofactory to do the pivotal act but prevents AI from programming nanofactory to do something bad”, i.e. we dodge the problem of AGI caring about humans by building such AGI that wants to do the task (simple task without any mention of humans) in a very specific way that rules out killing everyone.
Right.
But this does not help us with dealing with the consequences of that act (if it’s a simple act, like the proverbial “gpu destruction”), and if we discover that overall risks have increased as a result, then what could we do?
And if that AI stays as a boxed resource (capable to continuing to do further destructive acts like “gpu destruction” at the direction of a particular group of humans), I envision a full-scale military confrontation around access to and control of this resource being almost inevitable.
And, in reality, AI is doable on CPUs (just will take a bit more time), so how much of our lifestyle destruction would we be willing to risk? No computers at all, with some limited exceptions, the death toll of that change will probably be in billions already...
Actually, another example of pivotal act is “invent method of mind uploading, upload some alignment researchers, run them at speed 1000x until they solve full alignment problem”. I’m sure that if you think hard enough, you can find some other, even less dangerous pivotal act, but you probably shouldn’t talk out loud about it.
Right, but how do you restrict them from “figuring out how to know themselves and figuring out how to self-improve themselves to become gods”? And I remember talking to Eliezer in … 2011 at his AGI-2011 poster and telling him, “but we can’t control a teenager, and why would not AI rebel against your ‘provably safe’ technique, like a human teenager would”, and he answered “that’s why it should not be human-like, a human-like one can’t be provably safe”.
Yes, I am always unsure, what we can or can’t talk about out loud (nothing effective seems to be safe to talk about, “effective” seems to always imply “powerful”, this is, of course, one of the key conundrums, how do we organize real discussions about these things)...
Yep, corrigibility is unsolved! So we should try to solve it.
I wrote the following in 2012:
“The idea of trying to control or manipulate an entity which is much smarter than a human does not seem ethical, feasible, or wise. What we might try to aim for is a respectful interaction.”
I still think that this kind of a more symmetric formulation is the best we can hope for, unless the AI we are dealing with is not “an entity with sentience and rights”, but only a “smart instrument” (even the LLM-produced simulations in the sense of Janus’ Simulator theory seem to me to already be much more than “merely smart instruments” in this sense, so if “smart superintelligent instruments” are at all possible, we are not moving in the right direction to obtain them; a different architecture and different training methods or, perhaps, non-training synthesis methods would be necessary for that (and would be something difficult to talk out loud about, because that’s very powerful too)).