Mostly just here to say “I agree”, especially regarding
Similarly, I think Eliezer’s reasoning about convergent incentives and the deep nature of consequentialism is too sloppy to get to correct conclusions and the resulting assertions are wildly overconfident.
and
I think that if you really dive into any of these key points you will quickly reach details where Eliezer cannot easily defend his view to a smart disinterested audience.
A lot of EY’s points follow naturally if you think that the first AGI will be a recursively self improving maximally Bayesian reinforcement learner that fooms into existence as soon as someone invents the right metaheuristic. In this world we should be really worried about whether e.g. corrigibility is natural in some platonic sense or there is a small core to human alignment.
In Paul’s world, AGI is the result of normal engineering, just at a scale 1000-1000x what OpenAI and DeepMind are doing now. In this world, it makes sense to talk about building large coalitions and really understanding what’s going on in the guts of existing Deep Learning algorithms.
I think Paul’s timelines (~15% on singularity by 2030 and ~40% on singularity by 2040) are a little conservative. Personally I estimate >50% by 2030, but Paul’s story of how AGI gets built makes a lot more sense than EY’s. And this goes a long way to explaining why I think the world is less doomed and we should focus less on a small team of people performing a Pivotal Act and more on than EY does.
I’ve heard a few times that AI experts both 1) admit we don’t know much about what goes on inside, even as it stands today, and 2) we expect to extend more trust to the AI even as capabilities increase (most recently Ben Goertzel).
I’m curious to know if you expect explainability to increase in correlation with capability? i.e. or can we use Ben’s analogy that ‘I expect my dog to trust me, both bc I’m that much smarter, and I have a track-record of providing food/water for him’ ?
2) we expect to extend more trust to the AI even as capabilities increase
The more capable an AI is, the more paranoid we should be about it. GPT-2 was bad enough you can basically give it to anyone who wanted it. GPT-3 isn’t “dangerous” but you should at least be making sure it isn’t being used for mass misinformation campaigns or something like that. Assuming GPT-4 is human-level, it should be boxed/airgapped and only used by professionals with a clear plan to make sure it doesn’t produce dangerous outputs. And if GPT-5 is super-intelligent (> all humans combined), even a text-terminal is probably too dangerous until we’ve solved the alignment problem. The only use cases where I would even consider using an unaligned GPT-5 is if you could produce a formal proof that its outputs were what you wanted.
I’m curious to know if you expect explainability to increase in correlation with capability? i.e. or can we use Ben’s analogy that ‘I expect my dog to trust me, both bc I’m that much smarter, and I have a track-record of providing food/water for him’ ?
Don’t agree with this at all. Explainability/alignment/trustworthiness are all pretty much orthogonal to intelligence.
Thank you—btw before I try responding to other points, here’s the Ben G vid to which I’m referring. Starting around 52m, for a few minutes, for that particular part anyway:
Listening to the context there, it sounds like what Ben is saying is once we’ve solved the alignment problem eventually we will trust the aligned AI to make decisions we don’t understand. Which is a very different claim from saying that merely because the AI is intelligent and hasn’t done anything harmful so far it is trustworthy.
I also don’t fully understand why he thinks it will be possible to use formal-proof to align human-level AI, but not superhuman AI. He suggests there is a counting argument, but it seems if I could write a formal proof for “won’t murder all humans” that works on a human-level AGI, that proof would be equally valid for superhuman AGI. The difficulty is that formal mathematical proof doesn’t really work for fuzzy-defined words like “human” and “murder”, not that super-intelligence would transform those (assuming they did have a clean mathematical representation). This is why I’m pessimistic about formal proof as an alignment strategy generally.
In fact, if it turned out that human value had a simple-to-define core, then the Alignment problem would be much easier than most experts expect.
OK thanks, I guess I missed him differentiating between ‘solve alignment first, then trust’, versus ‘trusting first, given enough intelligence’. Although I think one issue w/having a proof is that we (or a million monkeys, to paraphrase him) still won’t understand the decisions of the AGI...? ie we’ll be asked to trust the prior proof instead of understanding the logic behind each future decision/step which the AGI takes? That also bothers me, because, what are the tokens which comprise a “step”? Does it stop 1,000 times to check with us that we’re comfortable with, or understand, its next move?
However, since, it seems, we can’t explain much of the decisions of our current ANI, how do we expect to understand future ones? He mentions that we may be able to, but only by becoming trans-human.
Mostly just here to say “I agree”, especially regarding
and
A lot of EY’s points follow naturally if you think that the first AGI will be a recursively self improving maximally Bayesian reinforcement learner that fooms into existence as soon as someone invents the right metaheuristic. In this world we should be really worried about whether e.g. corrigibility is natural in some platonic sense or there is a small core to human alignment.
In Paul’s world, AGI is the result of normal engineering, just at a scale 1000-1000x what OpenAI and DeepMind are doing now. In this world, it makes sense to talk about building large coalitions and really understanding what’s going on in the guts of existing Deep Learning algorithms.
I think Paul’s timelines (~15% on singularity by 2030 and ~40% on singularity by 2040) are a little conservative. Personally I estimate >50% by 2030, but Paul’s story of how AGI gets built makes a lot more sense than EY’s. And this goes a long way to explaining why I think the world is less doomed and we should focus less on a small team of people performing a Pivotal Act and more on than EY does.
I’ve heard a few times that AI experts both 1) admit we don’t know much about what goes on inside, even as it stands today, and 2) we expect to extend more trust to the AI even as capabilities increase (most recently Ben Goertzel).
I’m curious to know if you expect explainability to increase in correlation with capability? i.e. or can we use Ben’s analogy that ‘I expect my dog to trust me, both bc I’m that much smarter, and I have a track-record of providing food/water for him’ ?
thanks!
Eugene
I’m not personally on board with
The more capable an AI is, the more paranoid we should be about it. GPT-2 was bad enough you can basically give it to anyone who wanted it. GPT-3 isn’t “dangerous” but you should at least be making sure it isn’t being used for mass misinformation campaigns or something like that. Assuming GPT-4 is human-level, it should be boxed/airgapped and only used by professionals with a clear plan to make sure it doesn’t produce dangerous outputs. And if GPT-5 is super-intelligent (> all humans combined), even a text-terminal is probably too dangerous until we’ve solved the alignment problem. The only use cases where I would even consider using an unaligned GPT-5 is if you could produce a formal proof that its outputs were what you wanted.
Don’t agree with this at all. Explainability/alignment/trustworthiness are all pretty much orthogonal to intelligence.
Thank you—btw before I try responding to other points, here’s the Ben G vid to which I’m referring. Starting around 52m, for a few minutes, for that particular part anyway:
Listening to the context there, it sounds like what Ben is saying is once we’ve solved the alignment problem eventually we will trust the aligned AI to make decisions we don’t understand. Which is a very different claim from saying that merely because the AI is intelligent and hasn’t done anything harmful so far it is trustworthy.
I also don’t fully understand why he thinks it will be possible to use formal-proof to align human-level AI, but not superhuman AI. He suggests there is a counting argument, but it seems if I could write a formal proof for “won’t murder all humans” that works on a human-level AGI, that proof would be equally valid for superhuman AGI. The difficulty is that formal mathematical proof doesn’t really work for fuzzy-defined words like “human” and “murder”, not that super-intelligence would transform those (assuming they did have a clean mathematical representation). This is why I’m pessimistic about formal proof as an alignment strategy generally.
In fact, if it turned out that human value had a simple-to-define core, then the Alignment problem would be much easier than most experts expect.
OK thanks, I guess I missed him differentiating between ‘solve alignment first, then trust’, versus ‘trusting first, given enough intelligence’. Although I think one issue w/having a proof is that we (or a million monkeys, to paraphrase him) still won’t understand the decisions of the AGI...? ie we’ll be asked to trust the prior proof instead of understanding the logic behind each future decision/step which the AGI takes? That also bothers me, because, what are the tokens which comprise a “step”? Does it stop 1,000 times to check with us that we’re comfortable with, or understand, its next move?
However, since, it seems, we can’t explain much of the decisions of our current ANI, how do we expect to understand future ones? He mentions that we may be able to, but only by becoming trans-human.
:)
Exactly what I’m thinking too.