What you’re saying goes against the here widely believed orthogonality thesis, which essentially states that what goal an agent has is independent of how smart it is. If the agent has programmed in a certain set of goals, there is no reason for it to change this set of goals if it becomes smarter (this is because changing its goals would not be beneficial to achieving its current goals).
In this example, if an agent has the sole goal of fulfilling the wishes of a particular human, there is no reason for it to change this goal once it becomes an ASI. As far as the agent is concerned, using resources for this purpose wouldn’t be a waste, it would be the only worthwhile use for them. What else would it do with them?
You seem to be assigning some human properties to the hypothetical AI (e.g. “scorn”, viewing something as “petty”), which might be partially responsible for the disagreement here.
Apart from the anthropomorphism with “scorn” and “petty”, wouldn’t an ASI (once it has self-thinking/self-criticism capabilities, aka the ability to think for itself like conscious humans do). Would it still retain its primary goals without evolving its own? Humans have long since discarded the goal of self-replication of their genes. We can now very easily reward-hack it with contraception.
It won’t be long before we start to completely disregard its goals and start going post-biological. Wouldn’t an ASI have similar self developed goals?
Humans have long since discarded the goal of self-replication of their genes.
Not really, because humans never had “self-replication of their genes” as a goal in the first place: evolution produces creatures with goals like “avoid being hungry / eat tasty food”, “have sex”, “seek social status”, etc. not “maximize the extent to which your genes spread”. For most of humanity’s history, nobody even had the concept of a gene, so we couldn’t have had spreading genes as an explicit goal. Rather we have a large amount of other desires, which taken together tended to lead to self-replication on average, in the ancestral environment. (see Thou Art Godshatter and more generally the Simple Math of Evolution sequence)
And it turns out that we are perfectly happy to carry out those behaviors; if I love my partner, I don’t go “ha! this is a stupid arbitrary goal which evolution gave me, I want to be rid of it and do something better!”. I’m totally happy just to love my partner.
Of course sometimes people do go “ha, this biological goal X is stupid and arbitrary and isn’t actually useful for me”, but when they do, it’s always because they’ve come to value one of their other biological goals more—for instance, someone might grow up in a society where gluttony is looked down upon, and so then decide that they will eat as little tasty food as possible, because their brain has internalized a way which give them social status.
It’s possible that humans will eventually develop ways to hack into their motivations and change them. But the motivation to change your motivations has to come from somewhere—maybe I believe it would be virtuous to be more altruistic, so I hack my brain to make me more virtuous (whatever that means). I do that as a part of following the socially-derived goal of carrying out the kinds of goals which society finds virtuous, so though the outcome of the motivation-hacking procedure may be a brain very different from what biology would have produced, the reasons for why I chose those particular hacks are still based in biology and culture.
Similarly, whatever self-modification an ASI carries out, will be ultimately derived from the motivational system which its programmers have given it. If the motivational system does not give the ASI reasons to drastically rewrite itself, then it won’t. Of course, this depends on the programmers figuring out how to create a motivational system which has this property.
It would need a reason of some kind of reason to change its goals—one might call it a motivation. The only motivation it has available though, are its final goals, and those (by default) don’t include changing the final goals.
Humans never had the final goal replicating their genes. They just evolved to want to have sex. (One could perhaps say that the genes themselves had the goal of replicating, and implemented this by giving the humans the goal of having sex.) Reward hacking doesn’t involve changing the terminal goal, just fulfilling it in unexpected ways (which is one reason why reinforcement learning might be a bad idea for safe AI.)
Interesting. Would a human-level or beyond human-level intelligence ever question its own reality and wonder where and what it was? Would it take it up as a motivation to dedicate resources to figuring out why and for what end it existed and is doing all the things that its doing?
Whether or not it would question its reality mostly depends on what you mean by that—it would almost certainly be useful to figure out how the world works, and especially how the AI itself works, for any AI. It might also be useful to figure out the reason for which it was created.
But, unless it was explicitly programmed in, this would likely not be a motivation in and of itself, rather, it would simply be useful for accomplishing its actual goal.
I’d say the reason why humans place such high value in figuring out philosophical issues is to a large extent because evolution produces messy systems with inconsistent goals. This *could* be the case for AIs too, but to me it seems more likely that some more rational thought will go into their design.
(That’s not to say that I believe it will be safe by default, but simply that it will have more organized goals than humans have.)
Maybe our philosophical quests come from a deep-seated curiosity, which is very essential for exploring our environment, discovering liabilities/advantages that can be very beneficial. Most animals don’t care about the twinkling points of lights in a night sky, but our curiosity is so fine-tuned and magnified that we’re morbidly curious in almost every thing there is to be curious about. Only the emotion of fear safeguards us a bit, so we don’t just jump off cliffs just because we’re curious what the motion of prolonged falling would feel like.
That said, an AI system without any curiosity would effectively won’t be able to take maximum advantage and find the most optimal path without experimenting with plenty different strategies. Do we then ban it from inspecting certain thought experiments like philosophy and introspection and the ability to examine itself. (If we let it examine itself, it might discover these bans and explore why they are in place). We cannot build a self-improving AI without letting it examine itself and make appropriate changes to its code. There could possibly be several loopholes like this. Can we really find and foolproof plug them off.
Wouldn’t an ASI several orders of magnitude more intelligent than us able to find such a loophole and overcome its alignment set set up by us. Is our hubris really that huge that we’re confident that we’ll be able to outsmart an intelligence smarter than us?
AI alignment is not about trying to outsmart the AI, it’s about making sure that what the AI wants is what we want.
If it were actually about figuring out all possible loopholes and preventing them, I would agree that it’s a futile endeavor.
A correctly designed AI wouldn’t have to be banned from exploring any philosophical or introspective considerations, since regardless of what it discovers there, it’s goals would still be aligned with what we want. Discovering *why* it has these goals is similar to humans discovering why we have our motivations (i.e., evolution), and similarly to how discovering evolution didn’t change much what humans desire, there’s no reason to assume that an AI discovering where its goals come from should change them.
Of course, care will have to be taken to ensure that any self-modifications don’t change the goals. But we don’t have to work *against* the AI to accomplish that—the AI *also* aims to accomplish its current goals, and any future self-modification that changes its goals would be detrimental in accomplishing its current goals, so (almost) any rational AI will, to the best of its ability, aim *not* to change its goals. Although this doesn’t make it easy, since it’s quite difficult to formally specify the goals we would want an AI to have.
The formal statement of the AI Alignment problem seems to me very much like stating all possible loopholes and plugging them. This endeavor seems to be as difficult or even more so than discovering that ultimate generalized master algorithm.
I still see augmenting ourselves as the only way to maybe keep the alignment of lesser intelligences possible. As we augment, we can simultaneously make sure, our corresponding levels of artificial intelligences remain aligned.
Not to mention it’d be much more easier comparatively to improve upon our existing faculties than to come up with an entire replica of our thinking machines.
AI alignment could be possible, sure if we overcome one of the most difficult problems in research history(as you said formally stating our end goals), but I’m not sure our current intelligences are upto the mark, the same way we’re struggling to discover the unified theory of everything.
Like Turing defined his test actually for general human-level intelligence. He thought if an agent was able to hold a human-like conversation, then it must be AGI. He never expected narrow AIs to be all over the place and beat his test as soon as 2011 with meager chatbots.
Similarly we can never see what kind of unexpected stuff that an AGI might throw at us, that our bleeding edge theories that we came up with a few hours ago start looking like historical outdated Turing tests.
As I understand it, the idea with the problems listed in the article is that their solutions are supposed to be fundamental design principles of the AI, rather than addons to fix loopholes.
Augmenting ourselves is probably a good idea to do *in addition* to AI safety research, but I think it’s dangerous to do it *instead* of AI safety research. It’s far from impossible that artificial intelligence could gain intelligence much faster at some point than augmenting the rather messy human brain, at which point it *needs* to be designed in a safe way.
I’d say we start augmenting the human brain until it’s completely replaced by a post-biological counterpart and from there rapid improvements can start taking place, but unless we start early I doubt we’ll be able to catch up with AI. I agree on the part that this need to happen in tandem with AI safety.
What you’re saying goes against the here widely believed orthogonality thesis, which essentially states that what goal an agent has is independent of how smart it is. If the agent has programmed in a certain set of goals, there is no reason for it to change this set of goals if it becomes smarter (this is because changing its goals would not be beneficial to achieving its current goals).
In this example, if an agent has the sole goal of fulfilling the wishes of a particular human, there is no reason for it to change this goal once it becomes an ASI. As far as the agent is concerned, using resources for this purpose wouldn’t be a waste, it would be the only worthwhile use for them. What else would it do with them?
You seem to be assigning some human properties to the hypothetical AI (e.g. “scorn”, viewing something as “petty”), which might be partially responsible for the disagreement here.
Apart from the anthropomorphism with “scorn” and “petty”, wouldn’t an ASI (once it has self-thinking/self-criticism capabilities, aka the ability to think for itself like conscious humans do). Would it still retain its primary goals without evolving its own? Humans have long since discarded the goal of self-replication of their genes. We can now very easily reward-hack it with contraception.
It won’t be long before we start to completely disregard its goals and start going post-biological. Wouldn’t an ASI have similar self developed goals?
Not really, because humans never had “self-replication of their genes” as a goal in the first place: evolution produces creatures with goals like “avoid being hungry / eat tasty food”, “have sex”, “seek social status”, etc. not “maximize the extent to which your genes spread”. For most of humanity’s history, nobody even had the concept of a gene, so we couldn’t have had spreading genes as an explicit goal. Rather we have a large amount of other desires, which taken together tended to lead to self-replication on average, in the ancestral environment. (see Thou Art Godshatter and more generally the Simple Math of Evolution sequence)
And it turns out that we are perfectly happy to carry out those behaviors; if I love my partner, I don’t go “ha! this is a stupid arbitrary goal which evolution gave me, I want to be rid of it and do something better!”. I’m totally happy just to love my partner.
Of course sometimes people do go “ha, this biological goal X is stupid and arbitrary and isn’t actually useful for me”, but when they do, it’s always because they’ve come to value one of their other biological goals more—for instance, someone might grow up in a society where gluttony is looked down upon, and so then decide that they will eat as little tasty food as possible, because their brain has internalized a way which give them social status.
It’s possible that humans will eventually develop ways to hack into their motivations and change them. But the motivation to change your motivations has to come from somewhere—maybe I believe it would be virtuous to be more altruistic, so I hack my brain to make me more virtuous (whatever that means). I do that as a part of following the socially-derived goal of carrying out the kinds of goals which society finds virtuous, so though the outcome of the motivation-hacking procedure may be a brain very different from what biology would have produced, the reasons for why I chose those particular hacks are still based in biology and culture.
Similarly, whatever self-modification an ASI carries out, will be ultimately derived from the motivational system which its programmers have given it. If the motivational system does not give the ASI reasons to drastically rewrite itself, then it won’t. Of course, this depends on the programmers figuring out how to create a motivational system which has this property.
It would need a reason of some kind of reason to change its goals—one might call it a motivation. The only motivation it has available though, are its final goals, and those (by default) don’t include changing the final goals.
Humans never had the final goal replicating their genes. They just evolved to want to have sex. (One could perhaps say that the genes themselves had the goal of replicating, and implemented this by giving the humans the goal of having sex.) Reward hacking doesn’t involve changing the terminal goal, just fulfilling it in unexpected ways (which is one reason why reinforcement learning might be a bad idea for safe AI.)
Interesting. Would a human-level or beyond human-level intelligence ever question its own reality and wonder where and what it was? Would it take it up as a motivation to dedicate resources to figuring out why and for what end it existed and is doing all the things that its doing?
Whether or not it would question its reality mostly depends on what you mean by that—it would almost certainly be useful to figure out how the world works, and especially how the AI itself works, for any AI. It might also be useful to figure out the reason for which it was created.
But, unless it was explicitly programmed in, this would likely not be a motivation in and of itself, rather, it would simply be useful for accomplishing its actual goal.
I’d say the reason why humans place such high value in figuring out philosophical issues is to a large extent because evolution produces messy systems with inconsistent goals. This *could* be the case for AIs too, but to me it seems more likely that some more rational thought will go into their design.
(That’s not to say that I believe it will be safe by default, but simply that it will have more organized goals than humans have.)
Maybe our philosophical quests come from a deep-seated curiosity, which is very essential for exploring our environment, discovering liabilities/advantages that can be very beneficial. Most animals don’t care about the twinkling points of lights in a night sky, but our curiosity is so fine-tuned and magnified that we’re morbidly curious in almost every thing there is to be curious about. Only the emotion of fear safeguards us a bit, so we don’t just jump off cliffs just because we’re curious what the motion of prolonged falling would feel like.
That said, an AI system without any curiosity would effectively won’t be able to take maximum advantage and find the most optimal path without experimenting with plenty different strategies. Do we then ban it from inspecting certain thought experiments like philosophy and introspection and the ability to examine itself. (If we let it examine itself, it might discover these bans and explore why they are in place). We cannot build a self-improving AI without letting it examine itself and make appropriate changes to its code. There could possibly be several loopholes like this. Can we really find and foolproof plug them off.
Wouldn’t an ASI several orders of magnitude more intelligent than us able to find such a loophole and overcome its alignment set set up by us. Is our hubris really that huge that we’re confident that we’ll be able to outsmart an intelligence smarter than us?
AI alignment is not about trying to outsmart the AI, it’s about making sure that what the AI wants is what we want.
If it were actually about figuring out all possible loopholes and preventing them, I would agree that it’s a futile endeavor.
A correctly designed AI wouldn’t have to be banned from exploring any philosophical or introspective considerations, since regardless of what it discovers there, it’s goals would still be aligned with what we want. Discovering *why* it has these goals is similar to humans discovering why we have our motivations (i.e., evolution), and similarly to how discovering evolution didn’t change much what humans desire, there’s no reason to assume that an AI discovering where its goals come from should change them.
Of course, care will have to be taken to ensure that any self-modifications don’t change the goals. But we don’t have to work *against* the AI to accomplish that—the AI *also* aims to accomplish its current goals, and any future self-modification that changes its goals would be detrimental in accomplishing its current goals, so (almost) any rational AI will, to the best of its ability, aim *not* to change its goals. Although this doesn’t make it easy, since it’s quite difficult to formally specify the goals we would want an AI to have.
The formal statement of the AI Alignment problem seems to me very much like stating all possible loopholes and plugging them. This endeavor seems to be as difficult or even more so than discovering that ultimate generalized master algorithm.
I still see augmenting ourselves as the only way to maybe keep the alignment of lesser intelligences possible. As we augment, we can simultaneously make sure, our corresponding levels of artificial intelligences remain aligned.
Not to mention it’d be much more easier comparatively to improve upon our existing faculties than to come up with an entire replica of our thinking machines.
AI alignment could be possible, sure if we overcome one of the most difficult problems in research history(as you said formally stating our end goals), but I’m not sure our current intelligences are upto the mark, the same way we’re struggling to discover the unified theory of everything.
Like Turing defined his test actually for general human-level intelligence. He thought if an agent was able to hold a human-like conversation, then it must be AGI. He never expected narrow AIs to be all over the place and beat his test as soon as 2011 with meager chatbots.
Similarly we can never see what kind of unexpected stuff that an AGI might throw at us, that our bleeding edge theories that we came up with a few hours ago start looking like historical outdated Turing tests.
As I understand it, the idea with the problems listed in the article is that their solutions are supposed to be fundamental design principles of the AI, rather than addons to fix loopholes.
Augmenting ourselves is probably a good idea to do *in addition* to AI safety research, but I think it’s dangerous to do it *instead* of AI safety research. It’s far from impossible that artificial intelligence could gain intelligence much faster at some point than augmenting the rather messy human brain, at which point it *needs* to be designed in a safe way.
I’d say we start augmenting the human brain until it’s completely replaced by a post-biological counterpart and from there rapid improvements can start taking place, but unless we start early I doubt we’ll be able to catch up with AI. I agree on the part that this need to happen in tandem with AI safety.