“This is an ASI that’s smart enough to know what humans really want, but not smart (or corrigible) enough to accept a subtle goal function like “do what we mean”. This just doesn’t seem plausible to me.”
Why would a program that’s been built with goal function A suddenly switch over to using goal function B just because it’s become smart enough to understand goal function B? A chess playing AI probably has a pretty good idea of what its opponent’s goals are, but it will still try to make itself win, rather than its opponent.
I think the dispute about continuity would be better framed as a question of deep tech trees vs broad tech trees. The broad tech tree is one where there’s many branches, but only a few are viable, so you’re spending most of your time searching through different branches, trying to find the right one. A deep tech tree is the opposite: Fewer branches, but they’re long, so you’re spending a lot of time pushing those branches forward. In the deep tech tree case, we expect to see gradual progress as branches are pushed forwards, and probably a big economic impact from earlier advances on a branch that will lead to AGI. Broad branching means a single team can suddenly develop AGI, just because they happened to pick the right branch to work on. Of course, trees can be both broad and deep, or both narrow and short, and can have all kinds of detailed shapes. Probably better not to pretend we know very much about the shape of the tree for things we haven’t discovered yet.
I guess the crux here is that I don’t think HLMI need be so narrowly focused on mathematically specified goals. I think there is a kind of intelligence that doesn’t need goals to operate, and also a kind of intelligence that can actually understand that there is a “goal behind every goal” (aka theory of mind). A non-goal directed AI might be driven by creativity or a desire to learn. It might be like GPT-3 where it’s just trying to predict some complex input, but not in a goal-y way. You can obviously frame these things as goals, like in DeepMind’s paper “Reward is Enough”, but I do think there is a sense that intelligence doesn’t absolutely need to be goal-y. Yes, this gets into the oracle vs task vs agent AI debate.
Alignment arguments seem to imagine an “asocial AI” which can’t understand humans but can perform marvelously at narrowly focused mathematically specified goals. For example, “design a nanotech pathogen to kill every human on the planet” but not “convince every human on the planet to believe in my favored political ideology.” Such an AI would certainly be dangerous. The existence proof that intelligence exists, the human brain, also gives us strong reason to believe that an AI could be built which is not asocial and also not sociopathic. The big American tech companies are most interested in this kind of highly social AI, since they are soft powers and need charasmatic AI which people will ultimately like and not reject.
In my opinion the asocial AI is not “real intelligence” and that’s what average people and tech company CEOs will say about it. A “real AI” can have a conversation with a person, and if the person says “your goal is bad, here’s why, please do this instead,” the AI would switch goals. Perhaps that can be framed as a meta-goal such as “do what humans say (or mean)”, but I expect (with some probability) that by the time we get to “real AI” we’ll have some out-of-the-box thinking and have a much better idea of how to make corrigible AI.
In this framework, if we can make highly social AI before we get extremely powerful asocial AI, then we are safe. But if extremely powerful AI comes before we teach AI to be social, then there is big danger of an alignment problem.
Why would a program that’s been built with goal function A suddenly switch over to using goal function B just because it’s become smart enough to understand goal function B
Why would a programme have a goal function that’s complete separate from everything else? Our current most advanced AIs don’t. If it did, why would you want one implementation of human semantics in the goal function, and another one in the implementation function? Why reduplicate the effort?
What are you taking to be the current most advanced AIs? If it’s something like GPT-3, then the goal function is just to maximize log(probability assigned to the actual next token). This is separate from the rest of the network, though information flows back and forth. (Forwards because the network chooses the probabilities, and backwards though of back-propagation of gradients.) My point here is that GPT-N is not going to suddenly decide “hey, I’m going to use cos(product of all networks outputs together) as my new goal function”.
What I mean by a goal function is something thar, if changed, without changing anything else, will cause a general purpose AI to do something different. What I don’t mean is the vacuous sense in which a toaster has the goal of making toast. A toaster is not going to suddenly start boiling water, but that is because of its limitations, not because of a goal.
My point here is that GPT-N is not going to suddenly decide “hey, I’m going to use cos(product of all networks outputs together) as my new goal function”.
The idea isn’t that goal functions don’t set goals (where they really exist). The idea is that if you have a very specific GF that’s programmed on plain English, it’s perverse to do instantiate using a poorer NL module than is otherwise available.
“This is an ASI that’s smart enough to know what humans really want, but not smart (or corrigible) enough to accept a subtle goal function like “do what we mean”. This just doesn’t seem plausible to me.”
Why would a program that’s been built with goal function A suddenly switch over to using goal function B just because it’s become smart enough to understand goal function B? A chess playing AI probably has a pretty good idea of what its opponent’s goals are, but it will still try to make itself win, rather than its opponent.
I think the dispute about continuity would be better framed as a question of deep tech trees vs broad tech trees. The broad tech tree is one where there’s many branches, but only a few are viable, so you’re spending most of your time searching through different branches, trying to find the right one. A deep tech tree is the opposite: Fewer branches, but they’re long, so you’re spending a lot of time pushing those branches forward. In the deep tech tree case, we expect to see gradual progress as branches are pushed forwards, and probably a big economic impact from earlier advances on a branch that will lead to AGI. Broad branching means a single team can suddenly develop AGI, just because they happened to pick the right branch to work on. Of course, trees can be both broad and deep, or both narrow and short, and can have all kinds of detailed shapes. Probably better not to pretend we know very much about the shape of the tree for things we haven’t discovered yet.
I guess the crux here is that I don’t think HLMI need be so narrowly focused on mathematically specified goals. I think there is a kind of intelligence that doesn’t need goals to operate, and also a kind of intelligence that can actually understand that there is a “goal behind every goal” (aka theory of mind). A non-goal directed AI might be driven by creativity or a desire to learn. It might be like GPT-3 where it’s just trying to predict some complex input, but not in a goal-y way. You can obviously frame these things as goals, like in DeepMind’s paper “Reward is Enough”, but I do think there is a sense that intelligence doesn’t absolutely need to be goal-y. Yes, this gets into the oracle vs task vs agent AI debate.
Alignment arguments seem to imagine an “asocial AI” which can’t understand humans but can perform marvelously at narrowly focused mathematically specified goals. For example, “design a nanotech pathogen to kill every human on the planet” but not “convince every human on the planet to believe in my favored political ideology.” Such an AI would certainly be dangerous. The existence proof that intelligence exists, the human brain, also gives us strong reason to believe that an AI could be built which is not asocial and also not sociopathic. The big American tech companies are most interested in this kind of highly social AI, since they are soft powers and need charasmatic AI which people will ultimately like and not reject.
In my opinion the asocial AI is not “real intelligence” and that’s what average people and tech company CEOs will say about it. A “real AI” can have a conversation with a person, and if the person says “your goal is bad, here’s why, please do this instead,” the AI would switch goals. Perhaps that can be framed as a meta-goal such as “do what humans say (or mean)”, but I expect (with some probability) that by the time we get to “real AI” we’ll have some out-of-the-box thinking and have a much better idea of how to make corrigible AI.
In this framework, if we can make highly social AI before we get extremely powerful asocial AI, then we are safe. But if extremely powerful AI comes before we teach AI to be social, then there is big danger of an alignment problem.
Why would a programme have a goal function that’s complete separate from everything else? Our current most advanced AIs don’t. If it did, why would you want one implementation of human semantics in the goal function, and another one in the implementation function? Why reduplicate the effort?
What are you taking to be the current most advanced AIs? If it’s something like GPT-3, then the goal function is just to maximize log(probability assigned to the actual next token). This is separate from the rest of the network, though information flows back and forth. (Forwards because the network chooses the probabilities, and backwards though of back-propagation of gradients.) My point here is that GPT-N is not going to suddenly decide “hey, I’m going to use cos(product of all networks outputs together) as my new goal function”.
What I mean by a goal function is something thar, if changed, without changing anything else, will cause a general purpose AI to do something different. What I don’t mean is the vacuous sense in which a toaster has the goal of making toast. A toaster is not going to suddenly start boiling water, but that is because of its limitations, not because of a goal.
The idea isn’t that goal functions don’t set goals (where they really exist). The idea is that if you have a very specific GF that’s programmed on plain English, it’s perverse to do instantiate using a poorer NL module than is otherwise available.