What this would look like in practice would be the following (Taken from the proposed break of optimization daemon/inner misalignment):
Someone proposes a break of AI that threatens alignment like optimization daemons.
We test the claim on toy AIs, and either it doesn’t work or it does work on them, then we move to the next step.
We test the alignment break on a more realistic setting, and it turns out that the perceived break was going away.
Now, the key point is if a proposed break goes away or is made harder in more realistic settings, and especially if it keeps happening, we need to avoid giving credit to them for predicting the failure.
More generally, one issue I have is that I perceive an asymmetry between AI is dangerous and AI is safe people, in that if people were wrong about a danger, they’ll forget or not reference the fact that they’re wrong, but if they’re right about a danger, even if it’s much milder and some of their other predictions were wrong, people will treat you as an oracle.
A quote from lc’s post on POC or GTFO culture as counter to alignment wordcelism explains my thoughts on the issue better than I can:
The computer security industry happens to know this dynamic very well. No one notices the Fortune 500 company that doesn’t suffer the ransomware attack. Outside the industry, this active vs. negative bias is so prevalent that information security standards are constantly derided as “horrific” without articulating the sense in which they fail, and despite the fact that online banking works pretty well virtually all of the time. Inside the industry, vague and unverified predictions that Companies Will Have Security Incidents, or that New Tools Will Have Security Flaws, are treated much more favorably in retrospect than vague and unverified predictions that companies will mostly do fine. Even if you’re right that an attack vector is unimportant and probably won’t lead to any real world consequences, in retrospect your position will be considered obvious. On the other hand, if you say that an attack vector is important, and you’re wrong, people will also forget about that in three years. So better list everything that could possibly go wrong[1], even if certain mishaps are much more likely than others, and collect oracle points when half of your failure scenarios are proven correct.
Scott Alexander writes about the asymmetry in From Nostradamus To Fukuyama. Reversing biases of public perception isn’t much use for sorting out correctness of arguments.
I do have other issues with the security mindset, but that is an important issue I had.
Turning to this part though, I think I might see where I disagree:
Reversing biases of public perception isn’t much use for sorting out correctness of arguments.
It’s not just public perception, but also the very researchers are biased to believe that danger is or will happen. Critically, since this is asymmetrical, it means that this has more implications for doomy people than for optimistic people.
It’s why I’m a priori a bit skeptical of AI doom, and it’s also why it’s consistent to believe that the real probability of doom is very low, almost arbitrarily low, while people think the probability of doom is quite high: You don’t pay attention to the not doom or the things that went right, only the things that went wrong.
Yes, that’s true, but I have more evidence than that, and in particular I have evidence that directly argues against the proposition of AI doom, and that a lot of common arguments for AI doom.
The researchers aren’t the arguments, but the properties of the researchers looking into the arguments, especially the way they’re biased, does provide some evidence for certain proposition.
What this would look like in practice would be the following (Taken from the proposed break of optimization daemon/inner misalignment):
Someone proposes a break of AI that threatens alignment like optimization daemons.
We test the claim on toy AIs, and either it doesn’t work or it does work on them, then we move to the next step.
We test the alignment break on a more realistic setting, and it turns out that the perceived break was going away.
Now, the key point is if a proposed break goes away or is made harder in more realistic settings, and especially if it keeps happening, we need to avoid giving credit to them for predicting the failure.
More generally, one issue I have is that I perceive an asymmetry between AI is dangerous and AI is safe people, in that if people were wrong about a danger, they’ll forget or not reference the fact that they’re wrong, but if they’re right about a danger, even if it’s much milder and some of their other predictions were wrong, people will treat you as an oracle.
A quote from lc’s post on POC or GTFO culture as counter to alignment wordcelism explains my thoughts on the issue better than I can:
Scott Alexander writes about the asymmetry in From Nostradamus To Fukuyama. Reversing biases of public perception isn’t much use for sorting out correctness of arguments.
I do have other issues with the security mindset, but that is an important issue I had.
Turning to this part though, I think I might see where I disagree:
It’s not just public perception, but also the very researchers are biased to believe that danger is or will happen. Critically, since this is asymmetrical, it means that this has more implications for doomy people than for optimistic people.
It’s why I’m a priori a bit skeptical of AI doom, and it’s also why it’s consistent to believe that the real probability of doom is very low, almost arbitrarily low, while people think the probability of doom is quite high: You don’t pay attention to the not doom or the things that went right, only the things that went wrong.
The researchers are not the arguments. You are discussing correctness of researchers.
Yes, that’s true, but I have more evidence than that, and in particular I have evidence that directly argues against the proposition of AI doom, and that a lot of common arguments for AI doom.
The researchers aren’t the arguments, but the properties of the researchers looking into the arguments, especially the way they’re biased, does provide some evidence for certain proposition.