Listening to the context there, it sounds like what Ben is saying is once we’ve solved the alignment problem eventually we will trust the aligned AI to make decisions we don’t understand. Which is a very different claim from saying that merely because the AI is intelligent and hasn’t done anything harmful so far it is trustworthy.
I also don’t fully understand why he thinks it will be possible to use formal-proof to align human-level AI, but not superhuman AI. He suggests there is a counting argument, but it seems if I could write a formal proof for “won’t murder all humans” that works on a human-level AGI, that proof would be equally valid for superhuman AGI. The difficulty is that formal mathematical proof doesn’t really work for fuzzy-defined words like “human” and “murder”, not that super-intelligence would transform those (assuming they did have a clean mathematical representation). This is why I’m pessimistic about formal proof as an alignment strategy generally.
In fact, if it turned out that human value had a simple-to-define core, then the Alignment problem would be much easier than most experts expect.
OK thanks, I guess I missed him differentiating between ‘solve alignment first, then trust’, versus ‘trusting first, given enough intelligence’. Although I think one issue w/having a proof is that we (or a million monkeys, to paraphrase him) still won’t understand the decisions of the AGI...? ie we’ll be asked to trust the prior proof instead of understanding the logic behind each future decision/step which the AGI takes? That also bothers me, because, what are the tokens which comprise a “step”? Does it stop 1,000 times to check with us that we’re comfortable with, or understand, its next move?
However, since, it seems, we can’t explain much of the decisions of our current ANI, how do we expect to understand future ones? He mentions that we may be able to, but only by becoming trans-human.
Listening to the context there, it sounds like what Ben is saying is once we’ve solved the alignment problem eventually we will trust the aligned AI to make decisions we don’t understand. Which is a very different claim from saying that merely because the AI is intelligent and hasn’t done anything harmful so far it is trustworthy.
I also don’t fully understand why he thinks it will be possible to use formal-proof to align human-level AI, but not superhuman AI. He suggests there is a counting argument, but it seems if I could write a formal proof for “won’t murder all humans” that works on a human-level AGI, that proof would be equally valid for superhuman AGI. The difficulty is that formal mathematical proof doesn’t really work for fuzzy-defined words like “human” and “murder”, not that super-intelligence would transform those (assuming they did have a clean mathematical representation). This is why I’m pessimistic about formal proof as an alignment strategy generally.
In fact, if it turned out that human value had a simple-to-define core, then the Alignment problem would be much easier than most experts expect.
OK thanks, I guess I missed him differentiating between ‘solve alignment first, then trust’, versus ‘trusting first, given enough intelligence’. Although I think one issue w/having a proof is that we (or a million monkeys, to paraphrase him) still won’t understand the decisions of the AGI...? ie we’ll be asked to trust the prior proof instead of understanding the logic behind each future decision/step which the AGI takes? That also bothers me, because, what are the tokens which comprise a “step”? Does it stop 1,000 times to check with us that we’re comfortable with, or understand, its next move?
However, since, it seems, we can’t explain much of the decisions of our current ANI, how do we expect to understand future ones? He mentions that we may be able to, but only by becoming trans-human.
:)
Exactly what I’m thinking too.