The total absence of obvious output of this kind from the rest of the “AI safety” field even in 2020 causes me to regard them as having less actual ability to think in even a shallowly adversarial security mindset, than I associate with savvier science fiction authors. Go read fantasy novels about demons and telepathy, if you want a better appreciation of the convergent incentives of agents facing mindreaders than the “AI safety” field outside myself is currently giving you.
While this this may be a fair criticism, I feel like someone ought to point out that the vast majority of AI safety output (at least that I see on LW) isn’t trying to do anything like “sketch a probability distribution over the dynamics of an AI project that is nearing AGI”. This includes all technical MIRI papers I’m familiar with.
Perhaps we should be doing this (though, isn’t that more for AI forecasting/strategy rather than alignment? Of course still AI safety), but then the failure isn’t “no-one has enough security mindset” but rather something like “no-one has the social courage to tackle the problems that are actually important”. (This would be more similar to EY’s critique in the Discussion on AGI interventions post.)
isn’t trying to do anything like “sketch a probability distribution over the dynamics of an AI project that is nearing AGI”. This includes all technical MIRI papers I’m familiar with.
I think this specific scenario sketch is from a mainstream AI safety perspective a case where we’ve already failed—i.e. we’ve invented a useless corrigibility intervention that we confidently but wrongly think is scalable.
And if you try training the AI out of that habit in a domain of lower complexity and intelligence, it is predicted by me that generalizing that trained AI or subsystem to a domain of sufficiently higher complexity and intelligence, but where you could still actually see overt plots, would show you the AI plotting to kill you again.
If people try this repeatedly with other corrigibility training tricks on the level where plots are easily observable, they will eventually find a try that seems to generalize to the more complicated and intelligent validation set, but which kills you on the test set.
Most AI safety researchers just don’t agree with Eliezer that there’s no (likely to be found) corrigibility interventions that won’t suddenly and invisibly fail when you increase intelligence, no matter how well you’ve validated them on low capability regimes and how carefully you try to scale up. This is because they don’t agree with/haven’t heard of Eliezer’s arguments about consequentialism being a super-strong attractor.
So they’d think the ‘die with the most dignity’ interventions would just work, while the ‘die with no dignity’ interventions are risky, and quite reasonably push for the former (since it’s far from clear we’ll take the ‘dignified’ option by default): trying corrigibility interventions at low levels of intelligence, testing the AI on validation sets to see if it plots to kill them, while scaling up.
They might be wrong about this working, but if so, the wrongness isn’t in lacking enough security mindset to see that an AI trying to kill you would just alter its own cognition to cheat its way past the tests. Rather, their mistake is not expecting the corrigibility interventions they presumably trust to suddenly break in a way that means you get no useful safety guarantees from any amount of testing at lower capability levels.
I think it’s a shame Eliezer didn’t pose the ‘validation set’ question first before answering it himself, because I think if you got rid of the difference in underlying assumptions—i.e. asked an alignment researcher “Assume there’s a strong chance your corrigibility intervention won’t work upon scaling up and the AGI might start plotting against you, so you’re going to try these transparency/validation schemes on the AGI to check if it’s safe, how could they go wrong and is this a good idea?” they’d give basically the same answer—i.e. if you try this you’re probably going to die.
You could still reasonably say, “even if the AI safety community thinks it’s not the best use of resources because ensuring knowably stable corrigibility looks a lot easier to us, shouldn’t we still be working on some strongly deception-proof method of verifying if an agent is safe, so we can avoid killing ourselves if plan A fails?”
no-one has the social courage to tackle the problems that are actually important
I would be very surprised if this were true. I personally don’t feel any social pressure against sketching a probability distribution over the dynamics of an AI project that is nearing AGI.
I would guess that if people aren’t tackling Hard Problems enough, it’s not because they lack social courage, but because 1) they aren’t running a good-faith search for Hard Problems to begin with, or 2) they came up with reasons for not switching to the Hard Problems they thought of, or 3) they’re wrong about what problems are Hard Problems. My money’s mostly on (1), with a bit of (2).
While this this may be a fair criticism, I feel like someone ought to point out that the vast majority of AI safety output (at least that I see on LW) isn’t trying to do anything like “sketch a probability distribution over the dynamics of an AI project that is nearing AGI”. This includes all technical MIRI papers I’m familiar with.
Perhaps we should be doing this (though, isn’t that more for AI forecasting/strategy rather than alignment? Of course still AI safety), but then the failure isn’t “no-one has enough security mindset” but rather something like “no-one has the social courage to tackle the problems that are actually important”. (This would be more similar to EY’s critique in the Discussion on AGI interventions post.)
I think this specific scenario sketch is from a mainstream AI safety perspective a case where we’ve already failed—i.e. we’ve invented a useless corrigibility intervention that we confidently but wrongly think is scalable.
Most AI safety researchers just don’t agree with Eliezer that there’s no (likely to be found) corrigibility interventions that won’t suddenly and invisibly fail when you increase intelligence, no matter how well you’ve validated them on low capability regimes and how carefully you try to scale up. This is because they don’t agree with/haven’t heard of Eliezer’s arguments about consequentialism being a super-strong attractor.
So they’d think the ‘die with the most dignity’ interventions would just work, while the ‘die with no dignity’ interventions are risky, and quite reasonably push for the former (since it’s far from clear we’ll take the ‘dignified’ option by default): trying corrigibility interventions at low levels of intelligence, testing the AI on validation sets to see if it plots to kill them, while scaling up.
They might be wrong about this working, but if so, the wrongness isn’t in lacking enough security mindset to see that an AI trying to kill you would just alter its own cognition to cheat its way past the tests. Rather, their mistake is not expecting the corrigibility interventions they presumably trust to suddenly break in a way that means you get no useful safety guarantees from any amount of testing at lower capability levels.
I think it’s a shame Eliezer didn’t pose the ‘validation set’ question first before answering it himself, because I think if you got rid of the difference in underlying assumptions—i.e. asked an alignment researcher “Assume there’s a strong chance your corrigibility intervention won’t work upon scaling up and the AGI might start plotting against you, so you’re going to try these transparency/validation schemes on the AGI to check if it’s safe, how could they go wrong and is this a good idea?” they’d give basically the same answer—i.e. if you try this you’re probably going to die.
You could still reasonably say, “even if the AI safety community thinks it’s not the best use of resources because ensuring knowably stable corrigibility looks a lot easier to us, shouldn’t we still be working on some strongly deception-proof method of verifying if an agent is safe, so we can avoid killing ourselves if plan A fails?”
My answer would be yes.
I think we gotta get the message out that consequentialism is a super-strong attractor.
I would be very surprised if this were true. I personally don’t feel any social pressure against sketching a probability distribution over the dynamics of an AI project that is nearing AGI.
I would guess that if people aren’t tackling Hard Problems enough, it’s not because they lack social courage, but because 1) they aren’t running a good-faith search for Hard Problems to begin with, or 2) they came up with reasons for not switching to the Hard Problems they thought of, or 3) they’re wrong about what problems are Hard Problems. My money’s mostly on (1), with a bit of (2).