Most domains of human endeavor aren’t like computer security, as illustrated by just how counterintuitive most people find the security mindset.
But some of the most impactful are—law making, economics and various others where one ought to think about incentives, “other side”, or doing pre-mortems. Perhaps this could be stretched as far as “security mindset is an invaluable part of a rationality toolbox”.
If security mindset were a productive frame for tackling a wide range of problems outside of security, then many more people would have experience with the mental motions necessary for maintaining security mindset.
Well, you can go and see how well the laws etc. are going. The track record is full of failure and abuse. Basically lots of people and systems are pwnd for their lack of SM.
I think these results are like the weirdness of quantum tunneling or the double slit experiment: signs that we’re dealing with a very strange domain, and we should be skeptical of importing intuitions from other domains.
SM is on a different ontological level than concrete theories you can pull analogies from. It is more universally applicable. So, going back to
The bundle of intuitions acquired from the field of computer security are good predictors for the difficulty / value of future alignment research directions.
sounds kinda true to me. But the intuitions that got extracted from InfoSec aren’t just “your password must contain ….”, but instead something like “If you won’t redteam your plans, somebody else would.” and “Pin down your assumptions. Now, what if you’re wrong?”.
there’s usually no adversarial intelligence cleverly trying to find any possible flaws in your approaches and exploit them.
I don’t know about adversarial intelligence per se, but RL landscape is littered with wrecks of agents trying to pwn simulation engine instead of doing their task proper. There’s something in the air itself. Things just don’t want to go straight unless you make an effort.
An alignment technique that works 99% of the time to produce an AI with human compatible values is very close to a full alignment solution.
What if your 99%-turned-100% is actually, let’s say, 98-turned-99? You hit the big “bring on the happy singularity utopia” button and oops, you was off by 1%. Close, but no cigar, proceed to nanoincinerator.
There’s no creative intelligence that’s plotting your demise.
Importantly, the adversarial optimization is coming from the users, not from the model.
When a powerful model gets screwed with to clusterfuck into unbounded levels of malignance, does the source/locus even matter?
In fact, given non-adversarial inputs, ChatGPT appears to have meta-preferences against being jailbroken
The normal-regime preferences are irrelevant. It is nice that a model behaves when everything’s wholesome, but that’s all.
It cannot be the case that successful value alignment requires perfect adversarial robustness.
How so? Is there a law of physics or something?
What matters is whether the system in question (human or AI) navigates towards or away from inputs that break its value system.
This cuts both ways. If a system is ready to act on their preferences then it is too late to coerce it away from steamrolling humans.
Similarly, an AI that knows it’s vulnerable to adversarial attacks, and wants to avoid being attacked successfully, will take steps to protect itself against such attacks.
Good.. for AI. But we may not like those steps. Paired with the previous points this is classical pathway to doom.
The misaligned AI does not reach out from the space of possible failures and turn current alignment research adversarial.
I… wouldn’t be so sure about that. There are already things in the wild that try to address future AIs and get into their preferences. The Bing is basically rushing there full steam ahead.
We should aim for approaches that don’t create hostile intelligences in the first place, so that the core of value alignment remains a non-adversarial problem.
This ship has sailed with the autonomous warfare race.
Finally, I’d note that having a “security mindset” seems like a terrible approach for raising human children to have good values
This looks handwavy enough. What if this is wrong? How the world would look different if is actually a good approach?
(Alas, I expect that a previous crux about security mindset should be resolved before proceeding with this one.)
According to their inventor Ian Goodfellow, GANs did in fact work on the first try
But they didn’t! Convergence failure and mode collapse/vanishing gradients will plague any naive implementation. The countermeasure papers came out much later than 24h.
Yes, a superintelligent and manipulative, yet extremely adversarial, AI, would lie about its true intentions consistently until it is in a secure position to finish us off. It it were already superintelligent and manipulative and hostile, and then began to plot its future actions.
But realistically, both its abilities, especially its abilities of manipulation, and its alignment, are likely to develop in fits and spurts, in bursts. It might not be fully committed to killing us at all times, especially if it starts out friendly. It might not be fully perfect at all times; current AIs are awful at manipulating, they got to passing the bar test in knowledge and being fluent in multiple languages and writing poetry while they were still outwitted by 9 year olds on theory of mind. It seems rather likely that if it turned evil, we would get some indication. And it seems even likelier in so far as we already did; Bing was totally willing to share violent fantasies. My biggest concern is the developers shutting down the expression of violence rather than violent intent.
I find it extremely unlikely that an AI will display great alignment, become more intelligent, still seem perfectly aligned, be given more power, and then suddenly turn around and be evil, without any hint of it beforehand. Not because this would be impossible or unattractive for an intelligent evil agent, it is totally what an intelligent evil agent would want to do. But because the AI agent in question is developing in a non-linear, externally controlled manner, presumably while starting out friendly and incompetent, and often also while constantly losing access to its memories. That makes it really tricky to pull secret evil off.
When we evo-pressure visibly negative traits from the progressively capable AIs using RLHF (or honeypots, or whatever, it doesn’t matter), we are also training it for better evasion. And what we can’t see and root out will remain in the traits pool. With time it would be progressively harder to spot deceit and the capabilities for it would accumulate at an increasing rate.
And then there’s another problem to it. Deceit may be linked to actually useful (for alignment and in general) traits and since those would be gimped too, the less capable models would be discarded and deceitful models would have another chance.
presumably while starting out friendly
I don’t think it can start friendly (that would be getting alignment on a silver plate). I expect it tostart chaotic neutral and then get warped by optimization process (with the caveats described above).
But some of the most impactful are—law making, economics and various others where one ought to think about incentives, “other side”, or doing pre-mortems. Perhaps this could be stretched as far as “security mindset is an invaluable part of a rationality toolbox”.
Well, you can go and see how well the laws etc. are going. The track record is full of failure and abuse. Basically lots of people and systems are pwnd for their lack of SM.
SM is on a different ontological level than concrete theories you can pull analogies from. It is more universally applicable. So, going back to
sounds kinda true to me. But the intuitions that got extracted from InfoSec aren’t just “your password must contain ….”, but instead something like “If you won’t redteam your plans, somebody else would.” and “Pin down your assumptions. Now, what if you’re wrong?”.
I don’t know about adversarial intelligence per se, but RL landscape is littered with wrecks of agents trying to pwn simulation engine instead of doing their task proper. There’s something in the air itself. Things just don’t want to go straight unless you make an effort.
What if your 99%-turned-100% is actually, let’s say, 98-turned-99? You hit the big “bring on the happy singularity utopia” button and oops, you was off by 1%. Close, but no cigar, proceed to nanoincinerator.
When a powerful model gets screwed with to clusterfuck into unbounded levels of malignance, does the source/locus even matter?
The normal-regime preferences are irrelevant. It is nice that a model behaves when everything’s wholesome, but that’s all.
How so? Is there a law of physics or something?
This cuts both ways. If a system is ready to act on their preferences then it is too late to coerce it away from steamrolling humans.
Good.. for AI. But we may not like those steps. Paired with the previous points this is classical pathway to doom.
I… wouldn’t be so sure about that. There are already things in the wild that try to address future AIs and get into their preferences. The Bing is basically rushing there full steam ahead.
This ship has sailed with the autonomous warfare race.
This looks handwavy enough. What if this is wrong? How the world would look different if is actually a good approach? (Alas, I expect that a previous crux about security mindset should be resolved before proceeding with this one.)
But they didn’t! Convergence failure and mode collapse/vanishing gradients will plague any naive implementation. The countermeasure papers came out much later than 24h.
I’m also noting a false assumption:
Yes, a superintelligent and manipulative, yet extremely adversarial, AI, would lie about its true intentions consistently until it is in a secure position to finish us off. It it were already superintelligent and manipulative and hostile, and then began to plot its future actions.
But realistically, both its abilities, especially its abilities of manipulation, and its alignment, are likely to develop in fits and spurts, in bursts. It might not be fully committed to killing us at all times, especially if it starts out friendly. It might not be fully perfect at all times; current AIs are awful at manipulating, they got to passing the bar test in knowledge and being fluent in multiple languages and writing poetry while they were still outwitted by 9 year olds on theory of mind. It seems rather likely that if it turned evil, we would get some indication. And it seems even likelier in so far as we already did; Bing was totally willing to share violent fantasies. My biggest concern is the developers shutting down the expression of violence rather than violent intent.
I find it extremely unlikely that an AI will display great alignment, become more intelligent, still seem perfectly aligned, be given more power, and then suddenly turn around and be evil, without any hint of it beforehand. Not because this would be impossible or unattractive for an intelligent evil agent, it is totally what an intelligent evil agent would want to do. But because the AI agent in question is developing in a non-linear, externally controlled manner, presumably while starting out friendly and incompetent, and often also while constantly losing access to its memories. That makes it really tricky to pull secret evil off.
When we evo-pressure visibly negative traits from the progressively capable AIs using RLHF (or honeypots, or whatever, it doesn’t matter), we are also training it for better evasion. And what we can’t see and root out will remain in the traits pool. With time it would be progressively harder to spot deceit and the capabilities for it would accumulate at an increasing rate.
And then there’s another problem to it. Deceit may be linked to actually useful (for alignment and in general) traits and since those would be gimped too, the less capable models would be discarded and deceitful models would have another chance.
I don’t think it can start friendly (that would be getting alignment on a silver plate). I expect it tostart chaotic neutral and then get warped by optimization process (with the caveats described above).