you only start handing out status points after someone has successfully demonstrated the security failure
Maybe you’re right, we may need to deploy an AI system that demonstrates the potential to kill tens of millions of people before anyone really takes AI risk seriously. The AI equivalent of Trinity.
It’s not just about “being taken seriously”, although that’s a nice bonus—it’s also about getting shared understanding about what makes programs secure vs. insecure. You need a method of touching grass so that researchers have some idea of whether or not they’re making progress on the real issues.
You need a method of touching grass so that researchers have some idea of whether or not they’re making progress on the real issues.
We already can’t make MNIST digit recognizers secure against adversarial attacks. We don’t know how to prevent prompt injection. Convnets are vulnerable to adversarial attacks. RL agents that play Go at superhuman levels are vulnerable to simple strategies that exploit gaps in their cognition.
No, there’s plenty of evidence that we can’t make ML systems robust.
What is lacking is “concrete” evidence that that will result in blood and dead bodies.
We already can’t make MNIST digit recognizers secure against adversarial attacks. We don’t know how to prevent prompt injection. Convnets are vulnerable to adversarial attacks. RL agents that play Go at superhuman levels are vulnerable to simple strategies that exploit gaps in their cognition.
None of those things are examples of misalignment except arguably prompt injection, which seems like it’s being solved by OpenAI with ordinary engineering.
To me the security mindset seems inapplicable because in computer science, programs are rigid systems with narrow targets. AI is not very rigid and the target, I.e. an aligned mind, is not necessarily narrow.
To me the security mindset seems inapplicable because in computer science, programs are rigid systems with narrow targets. AI is not very rigid and the target, I.e. an aligned mind, is not necessarily narrow.
That rigidity is what makes computer security so easy.
No the rigidity is what makes a system error prone i.e. brittle. If you don’t specify the solution exactly, the machine won’t solve the problem. Classic computer programs
can’t generalize.
The OP makes a point how you can double a model size and it will work well but if you double a computer programs binary size with unused lines of code you can get all sorts of weird errors. Even if none of that extra size is ever used.
An analogy is trying to write a symbolic logic program to emulate an LLM. (Ie with only if statements and for loops) or trying to make a self driving car with Boolean logic.
If I flip one single bit in a computer program, it will probably catastrophically fail and crash the whole computer. However removing random weights won’t do much to an LLM.
a little tangent on the flipping a bit:
Flipping a bit in the actual binary itself (the thing the computer reads to run the program) will probably cause the computer to access a part of itself it wasn’t supposed to and immediately crash.
Changing a letter in a computer program that humans write will almost certainly cause the program to not compile.
Flipping a bit in the actual binary itself (the thing the computer reads to run the program) will probably cause the computer to access a part of itself it wasn’t supposed to and immediately crash.
Changing a letter in a computer program that humans write will almost certainly cause the program to not compile.
Yep, these are the important parts, and Neural Networks are much more robust than that, and it has extreme robustness compared to a lot of other fields, which is why I’m skeptical of applying the security mindset, since it would predict false things.
The non-rigidity of ChatGPT and its ilk does not make them less error-prone. Indeed, ChatGPT text is usually full of errors. But the errors are just as non-rigid. So are the means, if they can be found, of fixing them. ChatGPT output has to be read with attention to see its emptiness.
The point is that if it was like computer security or even computer engineering, those errors would completely destroy ChatGPT’s intelligence, and make it as useless as a random computer. This is just one example of an observation like this that makes me skeptical of applying the security mindset, as ML/AI and it’s subfield, ML/AI alignment is a strange enough field that I wouldn’t port over any intuitions from other fields.
ML/AI alignment is like quantum mechanics, in which you need to leave your intuitions at the door, and unfortunately this makes public outreach likely net-negative.
At this point it is not clear to me what you mean by security mindset. I understand by it what Bruce Schneier described in the article I linked, and what Eliezer describes here (which cites and quotes from Bruce Schneier). You have cited QuintinPope, who also cites the Eliezer article, but gets from it this concept of “security mindset”: “The bundle of intuitions acquired from the field of computer security are good predictors for the difficulty / value of future alignment research directions”. From this and his further words about the concept, he seems to mean something like “programming mindset”, i.e. good practice in software engineering. Only if I read both you and him as using “security mindset” to mean that can I make sense of the way you both use the term.
But that is simply not what “security mindset” means. Recall that Schneier’s article began with the example of a company selling ant farms by mail order, nothing to do with software. After several more examples, only one of which concerns computers, he gives his own short characterisation of the concept that he is talking about:
the security mindset involves thinking about how things can be made to fail. It involves thinking like an attacker, an adversary or a criminal. You don’t have to exploit the vulnerabilities you find, but if you don’t see the world that way, you’ll never notice most security problems.
Later on he describes its opposite:
The designers are so busy making these systems work that they don’t stop to notice how they might fail or be made to fail, and then how those failures might be exploited.
That is what Eliezer is talking about, when he is talking about security mindset.
Yes, prompting ChatGPT is not like writing a software library like pytorch. That does not make getting ChatGPT to do what you want and only what you want any easier or safer. In fact, it is much more difficult. Look at all the jailbreaks for ChatGPT and other chatbots, where they have been made to say things they were intended not to say, and answer questions they were intended not to answer.
My issue with the security mindset is that there’s a selection effect/bias that causes people to notice the failures of security, and not it’s successes, even if the true evidence for success is massively larger than it’s failure.
Here’s a quote from lc’s post POC or GTFO as a counter to alignment wordcelism, on why the security industry has massive issues with people claiming security failures when they don’t or can’t happen:
Even if you’re right that an attack vector is unimportant and probably won’t lead to any real world consequences, in retrospect your position will be considered obvious. On the other hand, if you say that an attack vector is important, and you’re wrong, people will also forget about that in three years. So better list everything that could possibly go wrong[1], even if certain mishaps are much more likely than others, and collect oracle points when half of your failure scenarios are proven correct.
And this is why in general I dislike the security mindset, because of the incentives to report failure or bad events even when they aren’t very much of a concern.
Also, the stuff that computer security people do largely doesn’t need to be done in ML/AI, which is another reason I’m skeptical of the security mindset.
They do matter, since it implies a sort of selection effect where people will share the evidence for doom, and not notice the evidence for not-doom, and this matters because the real chance of doom may be much lower, in principle arbitrarily low, while LWers and AI safety/governance organizations have higher probabilities of doom.
Combined with more standard biases on negative news being selected for, it is one piece in why I think AI doom is very unlikely. This is just one piece of it, not my entire argument
And I think this already happened, cf the entire inner misalignment/optimization daemon situation, where it was tested twice, once showing a confirmed break, and the other one by Ulisse Mini, where in a more realistic situation, the optimization daemon/inner misalignment went away, and very little shared on this result, compared to the original which almost certainly got more views.
Maybe you’re right, we may need to deploy an AI system that demonstrates the potential to kill tens of millions of people before anyone really takes AI risk seriously. The AI equivalent of Trinity.
https://en.wikipedia.org/wiki/Trinity_(nuclear_test)
It’s not just about “being taken seriously”, although that’s a nice bonus—it’s also about getting shared understanding about what makes programs secure vs. insecure. You need a method of touching grass so that researchers have some idea of whether or not they’re making progress on the real issues.
We already can’t make MNIST digit recognizers secure against adversarial attacks. We don’t know how to prevent prompt injection. Convnets are vulnerable to adversarial attacks. RL agents that play Go at superhuman levels are vulnerable to simple strategies that exploit gaps in their cognition.
No, there’s plenty of evidence that we can’t make ML systems robust.
What is lacking is “concrete” evidence that that will result in blood and dead bodies.
None of those things are examples of misalignment except arguably prompt injection, which seems like it’s being solved by OpenAI with ordinary engineering.
To me the security mindset seems inapplicable because in computer science, programs are rigid systems with narrow targets. AI is not very rigid and the target, I.e. an aligned mind, is not necessarily narrow.
That rigidity is what makes computer security so easy.
...
Relative to AGI security.
No the rigidity is what makes a system error prone i.e. brittle. If you don’t specify the solution exactly, the machine won’t solve the problem. Classic computer programs can’t generalize.
The OP makes a point how you can double a model size and it will work well but if you double a computer programs binary size with unused lines of code you can get all sorts of weird errors. Even if none of that extra size is ever used.
An analogy is trying to write a symbolic logic program to emulate an LLM. (Ie with only if statements and for loops) or trying to make a self driving car with Boolean logic.
If I flip one single bit in a computer program, it will probably catastrophically fail and crash the whole computer. However removing random weights won’t do much to an LLM.
a little tangent on the flipping a bit:
Flipping a bit in the actual binary itself (the thing the computer reads to run the program) will probably cause the computer to access a part of itself it wasn’t supposed to and immediately crash.
Changing a letter in a computer program that humans write will almost certainly cause the program to not compile.
Yep, these are the important parts, and Neural Networks are much more robust than that, and it has extreme robustness compared to a lot of other fields, which is why I’m skeptical of applying the security mindset, since it would predict false things.
The non-rigidity of ChatGPT and its ilk does not make them less error-prone. Indeed, ChatGPT text is usually full of errors. But the errors are just as non-rigid. So are the means, if they can be found, of fixing them. ChatGPT output has to be read with attention to see its emptiness.
None of this has anything to do with security mindset, as I understand the term.
The point is that if it was like computer security or even computer engineering, those errors would completely destroy ChatGPT’s intelligence, and make it as useless as a random computer. This is just one example of an observation like this that makes me skeptical of applying the security mindset, as ML/AI and it’s subfield, ML/AI alignment is a strange enough field that I wouldn’t port over any intuitions from other fields.
ML/AI alignment is like quantum mechanics, in which you need to leave your intuitions at the door, and unfortunately this makes public outreach likely net-negative.
At this point it is not clear to me what you mean by security mindset. I understand by it what Bruce Schneier described in the article I linked, and what Eliezer describes here (which cites and quotes from Bruce Schneier). You have cited QuintinPope, who also cites the Eliezer article, but gets from it this concept of “security mindset”: “The bundle of intuitions acquired from the field of computer security are good predictors for the difficulty / value of future alignment research directions”. From this and his further words about the concept, he seems to mean something like “programming mindset”, i.e. good practice in software engineering. Only if I read both you and him as using “security mindset” to mean that can I make sense of the way you both use the term.
But that is simply not what “security mindset” means. Recall that Schneier’s article began with the example of a company selling ant farms by mail order, nothing to do with software. After several more examples, only one of which concerns computers, he gives his own short characterisation of the concept that he is talking about:
Later on he describes its opposite:
That is what Eliezer is talking about, when he is talking about security mindset.
Yes, prompting ChatGPT is not like writing a software library like pytorch. That does not make getting ChatGPT to do what you want and only what you want any easier or safer. In fact, it is much more difficult. Look at all the jailbreaks for ChatGPT and other chatbots, where they have been made to say things they were intended not to say, and answer questions they were intended not to answer.
My issue with the security mindset is that there’s a selection effect/bias that causes people to notice the failures of security, and not it’s successes, even if the true evidence for success is massively larger than it’s failure.
Here’s a quote from lc’s post POC or GTFO as a counter to alignment wordcelism, on why the security industry has massive issues with people claiming security failures when they don’t or can’t happen:
And this is why in general I dislike the security mindset, because of the incentives to report failure or bad events even when they aren’t very much of a concern.
Also, the stuff that computer security people do largely doesn’t need to be done in ML/AI, which is another reason I’m skeptical of the security mindset.
These are parochial matters within the computer security community, and do not bear on the hazards of AGI.
They do matter, since it implies a sort of selection effect where people will share the evidence for doom, and not notice the evidence for not-doom, and this matters because the real chance of doom may be much lower, in principle arbitrarily low, while LWers and AI safety/governance organizations have higher probabilities of doom.
Combined with more standard biases on negative news being selected for, it is one piece in why I think AI doom is very unlikely. This is just one piece of it, not my entire argument
And I think this already happened, cf the entire inner misalignment/optimization daemon situation, where it was tested twice, once showing a confirmed break, and the other one by Ulisse Mini, where in a more realistic situation, the optimization daemon/inner misalignment went away, and very little shared on this result, compared to the original which almost certainly got more views.