2) we expect to extend more trust to the AI even as capabilities increase
The more capable an AI is, the more paranoid we should be about it. GPT-2 was bad enough you can basically give it to anyone who wanted it. GPT-3 isn’t “dangerous” but you should at least be making sure it isn’t being used for mass misinformation campaigns or something like that. Assuming GPT-4 is human-level, it should be boxed/airgapped and only used by professionals with a clear plan to make sure it doesn’t produce dangerous outputs. And if GPT-5 is super-intelligent (> all humans combined), even a text-terminal is probably too dangerous until we’ve solved the alignment problem. The only use cases where I would even consider using an unaligned GPT-5 is if you could produce a formal proof that its outputs were what you wanted.
I’m curious to know if you expect explainability to increase in correlation with capability? i.e. or can we use Ben’s analogy that ‘I expect my dog to trust me, both bc I’m that much smarter, and I have a track-record of providing food/water for him’ ?
Don’t agree with this at all. Explainability/alignment/trustworthiness are all pretty much orthogonal to intelligence.
Thank you—btw before I try responding to other points, here’s the Ben G vid to which I’m referring. Starting around 52m, for a few minutes, for that particular part anyway:
Listening to the context there, it sounds like what Ben is saying is once we’ve solved the alignment problem eventually we will trust the aligned AI to make decisions we don’t understand. Which is a very different claim from saying that merely because the AI is intelligent and hasn’t done anything harmful so far it is trustworthy.
I also don’t fully understand why he thinks it will be possible to use formal-proof to align human-level AI, but not superhuman AI. He suggests there is a counting argument, but it seems if I could write a formal proof for “won’t murder all humans” that works on a human-level AGI, that proof would be equally valid for superhuman AGI. The difficulty is that formal mathematical proof doesn’t really work for fuzzy-defined words like “human” and “murder”, not that super-intelligence would transform those (assuming they did have a clean mathematical representation). This is why I’m pessimistic about formal proof as an alignment strategy generally.
In fact, if it turned out that human value had a simple-to-define core, then the Alignment problem would be much easier than most experts expect.
OK thanks, I guess I missed him differentiating between ‘solve alignment first, then trust’, versus ‘trusting first, given enough intelligence’. Although I think one issue w/having a proof is that we (or a million monkeys, to paraphrase him) still won’t understand the decisions of the AGI...? ie we’ll be asked to trust the prior proof instead of understanding the logic behind each future decision/step which the AGI takes? That also bothers me, because, what are the tokens which comprise a “step”? Does it stop 1,000 times to check with us that we’re comfortable with, or understand, its next move?
However, since, it seems, we can’t explain much of the decisions of our current ANI, how do we expect to understand future ones? He mentions that we may be able to, but only by becoming trans-human.
I’m not personally on board with
The more capable an AI is, the more paranoid we should be about it. GPT-2 was bad enough you can basically give it to anyone who wanted it. GPT-3 isn’t “dangerous” but you should at least be making sure it isn’t being used for mass misinformation campaigns or something like that. Assuming GPT-4 is human-level, it should be boxed/airgapped and only used by professionals with a clear plan to make sure it doesn’t produce dangerous outputs. And if GPT-5 is super-intelligent (> all humans combined), even a text-terminal is probably too dangerous until we’ve solved the alignment problem. The only use cases where I would even consider using an unaligned GPT-5 is if you could produce a formal proof that its outputs were what you wanted.
Don’t agree with this at all. Explainability/alignment/trustworthiness are all pretty much orthogonal to intelligence.
Thank you—btw before I try responding to other points, here’s the Ben G vid to which I’m referring. Starting around 52m, for a few minutes, for that particular part anyway:
Listening to the context there, it sounds like what Ben is saying is once we’ve solved the alignment problem eventually we will trust the aligned AI to make decisions we don’t understand. Which is a very different claim from saying that merely because the AI is intelligent and hasn’t done anything harmful so far it is trustworthy.
I also don’t fully understand why he thinks it will be possible to use formal-proof to align human-level AI, but not superhuman AI. He suggests there is a counting argument, but it seems if I could write a formal proof for “won’t murder all humans” that works on a human-level AGI, that proof would be equally valid for superhuman AGI. The difficulty is that formal mathematical proof doesn’t really work for fuzzy-defined words like “human” and “murder”, not that super-intelligence would transform those (assuming they did have a clean mathematical representation). This is why I’m pessimistic about formal proof as an alignment strategy generally.
In fact, if it turned out that human value had a simple-to-define core, then the Alignment problem would be much easier than most experts expect.
OK thanks, I guess I missed him differentiating between ‘solve alignment first, then trust’, versus ‘trusting first, given enough intelligence’. Although I think one issue w/having a proof is that we (or a million monkeys, to paraphrase him) still won’t understand the decisions of the AGI...? ie we’ll be asked to trust the prior proof instead of understanding the logic behind each future decision/step which the AGI takes? That also bothers me, because, what are the tokens which comprise a “step”? Does it stop 1,000 times to check with us that we’re comfortable with, or understand, its next move?
However, since, it seems, we can’t explain much of the decisions of our current ANI, how do we expect to understand future ones? He mentions that we may be able to, but only by becoming trans-human.
:)
Exactly what I’m thinking too.