Citation needed? The one computer security person I know who read Yudkowsky’s post said it was a good description of security mindset. POC||GTFO sounds useful and important too but I doubt it’s the core of the concept.
Also, if the toy models, baby-AGI-setups like AutoGPT, and historical examples we’ve provided so far don’t meet your standards for “example of the thing you’re maybe worried about” with respect to AGI risk, (and you think that we should GTFO until we have an example that meets your standards) then your standards are way too high.
If instead POC||GTFO applied to AGI risk means “we should try really hard to get concrete, use formal toy models when possible, create model organisms to study, etc.” then we are already doing that and have been.
On POCs for misalignment, specifically for goal misgeneralization, there are pretty fundamental differences between what was shown and what was predicted so far, and one of them is that the train and test behavior in different environments are similar or the same, while in goal misgeneralization speculations, the train and test behavior are wildly different:
Rohin Shah has a comment on why most POCs aren’t that great here:
Nevertheless, if you think that this isn’t good enough and that people worried about AGI risk should GTFO until they have something better, you are the one who is wrong.
I don’t think people worried about AGI risk should GTFO.
I do think we should stop giving them as much credit as we do, because of the fact that you are likely to privilege the hypothesis, and it does mean that we shouldn’t count the POCs as vindicating the people worried about AI safety, since their evidence doesn’t really work to support the claim of goal misgeneralization.
I think that’s a vague enough claim that it’s basically a setup for motte-and-bailey. “Stop giving them as much credit as we do.” Well I think that if ‘we’ = society in general, then we should start giving them way more credit, in fact. If ‘we’ = various LWers who don’t think for themselves and just repeat what Yudkowsky says, then yes I agree. If ‘we’ = me, then no thank you I believe I am allocating credit appropriately, I take the point about privileging the hypothesis but I was well aware of it already.
What this would look like in practice would be the following (Taken from the proposed break of optimization daemon/inner misalignment):
Someone proposes a break of AI that threatens alignment like optimization daemons.
We test the claim on toy AIs, and either it doesn’t work or it does work on them, then we move to the next step.
We test the alignment break on a more realistic setting, and it turns out that the perceived break was going away.
Now, the key point is if a proposed break goes away or is made harder in more realistic settings, and especially if it keeps happening, we need to avoid giving credit to them for predicting the failure.
More generally, one issue I have is that I perceive an asymmetry between AI is dangerous and AI is safe people, in that if people were wrong about a danger, they’ll forget or not reference the fact that they’re wrong, but if they’re right about a danger, even if it’s much milder and some of their other predictions were wrong, people will treat you as an oracle.
A quote from lc’s post on POC or GTFO culture as counter to alignment wordcelism explains my thoughts on the issue better than I can:
The computer security industry happens to know this dynamic very well. No one notices the Fortune 500 company that doesn’t suffer the ransomware attack. Outside the industry, this active vs. negative bias is so prevalent that information security standards are constantly derided as “horrific” without articulating the sense in which they fail, and despite the fact that online banking works pretty well virtually all of the time. Inside the industry, vague and unverified predictions that Companies Will Have Security Incidents, or that New Tools Will Have Security Flaws, are treated much more favorably in retrospect than vague and unverified predictions that companies will mostly do fine. Even if you’re right that an attack vector is unimportant and probably won’t lead to any real world consequences, in retrospect your position will be considered obvious. On the other hand, if you say that an attack vector is important, and you’re wrong, people will also forget about that in three years. So better list everything that could possibly go wrong[1], even if certain mishaps are much more likely than others, and collect oracle points when half of your failure scenarios are proven correct.
Scott Alexander writes about the asymmetry in From Nostradamus To Fukuyama. Reversing biases of public perception isn’t much use for sorting out correctness of arguments.
I do have other issues with the security mindset, but that is an important issue I had.
Turning to this part though, I think I might see where I disagree:
Reversing biases of public perception isn’t much use for sorting out correctness of arguments.
It’s not just public perception, but also the very researchers are biased to believe that danger is or will happen. Critically, since this is asymmetrical, it means that this has more implications for doomy people than for optimistic people.
It’s why I’m a priori a bit skeptical of AI doom, and it’s also why it’s consistent to believe that the real probability of doom is very low, almost arbitrarily low, while people think the probability of doom is quite high: You don’t pay attention to the not doom or the things that went right, only the things that went wrong.
Yes, that’s true, but I have more evidence than that, and in particular I have evidence that directly argues against the proposition of AI doom, and that a lot of common arguments for AI doom.
The researchers aren’t the arguments, but the properties of the researchers looking into the arguments, especially the way they’re biased, does provide some evidence for certain proposition.
Citation needed? The one computer security person I know who read Yudkowsky’s post said it was a good description of security mindset. POC||GTFO sounds useful and important too but I doubt it’s the core of the concept.
Also, if the toy models, baby-AGI-setups like AutoGPT, and historical examples we’ve provided so far don’t meet your standards for “example of the thing you’re maybe worried about” with respect to AGI risk, (and you think that we should GTFO until we have an example that meets your standards) then your standards are way too high.
If instead POC||GTFO applied to AGI risk means “we should try really hard to get concrete, use formal toy models when possible, create model organisms to study, etc.” then we are already doing that and have been.
On POCs for misalignment, specifically for goal misgeneralization, there are pretty fundamental differences between what was shown and what was predicted so far, and one of them is that the train and test behavior in different environments are similar or the same, while in goal misgeneralization speculations, the train and test behavior are wildly different:
Rohin Shah has a comment on why most POCs aren’t that great here:
https://www.lesswrong.com/posts/xsB3dDg5ubqnT7nsn/poc-or-or-gtfo-culture-as-partial-antidote-to-alignment#P3phaBxvzX7KTyhf5
Nevertheless, if you think that this isn’t good enough and that people worried about AGI risk should GTFO until they have something better, you are the one who is wrong.
I don’t think people worried about AGI risk should GTFO.
I do think we should stop giving them as much credit as we do, because of the fact that you are likely to privilege the hypothesis, and it does mean that we shouldn’t count the POCs as vindicating the people worried about AI safety, since their evidence doesn’t really work to support the claim of goal misgeneralization.
I think that’s a vague enough claim that it’s basically a setup for motte-and-bailey. “Stop giving them as much credit as we do.” Well I think that if ‘we’ = society in general, then we should start giving them way more credit, in fact. If ‘we’ = various LWers who don’t think for themselves and just repeat what Yudkowsky says, then yes I agree. If ‘we’ = me, then no thank you I believe I am allocating credit appropriately, I take the point about privileging the hypothesis but I was well aware of it already.
What this would look like in practice would be the following (Taken from the proposed break of optimization daemon/inner misalignment):
Someone proposes a break of AI that threatens alignment like optimization daemons.
We test the claim on toy AIs, and either it doesn’t work or it does work on them, then we move to the next step.
We test the alignment break on a more realistic setting, and it turns out that the perceived break was going away.
Now, the key point is if a proposed break goes away or is made harder in more realistic settings, and especially if it keeps happening, we need to avoid giving credit to them for predicting the failure.
More generally, one issue I have is that I perceive an asymmetry between AI is dangerous and AI is safe people, in that if people were wrong about a danger, they’ll forget or not reference the fact that they’re wrong, but if they’re right about a danger, even if it’s much milder and some of their other predictions were wrong, people will treat you as an oracle.
A quote from lc’s post on POC or GTFO culture as counter to alignment wordcelism explains my thoughts on the issue better than I can:
Scott Alexander writes about the asymmetry in From Nostradamus To Fukuyama. Reversing biases of public perception isn’t much use for sorting out correctness of arguments.
I do have other issues with the security mindset, but that is an important issue I had.
Turning to this part though, I think I might see where I disagree:
It’s not just public perception, but also the very researchers are biased to believe that danger is or will happen. Critically, since this is asymmetrical, it means that this has more implications for doomy people than for optimistic people.
It’s why I’m a priori a bit skeptical of AI doom, and it’s also why it’s consistent to believe that the real probability of doom is very low, almost arbitrarily low, while people think the probability of doom is quite high: You don’t pay attention to the not doom or the things that went right, only the things that went wrong.
The researchers are not the arguments. You are discussing correctness of researchers.
Yes, that’s true, but I have more evidence than that, and in particular I have evidence that directly argues against the proposition of AI doom, and that a lot of common arguments for AI doom.
The researchers aren’t the arguments, but the properties of the researchers looking into the arguments, especially the way they’re biased, does provide some evidence for certain proposition.