People are currently predictably too worried about misuse risks
What people really mean by “open source” vs “closed source” labs is actually “responsible” vs “irresponsible” labs, which is not affected by regulations targeting open source model deployment.
Neuroscience as an outer alignment[1] strategy is embarrassingly underrated.
Better information security at labs is not clearly a good thing, and if we’re worried about great power conflict, probably a bad thing.
Much research on deception (Anthropic’s recent work, trojans, jailbreaks, etc) is not targeting “real” instrumentally convergent deception reasoning, but learned heuristics. Not bad in itself, but IMO this places heavy asterisks on the results they can get.
ML robustness research (like FAR Labs’ Go stuff) does not help with alignment, and helps moderately for capabilities.
The field of ML is a bad field to take epistemic lessons from. Note I don’t talk about the results from ML.
ARC’s MAD seems doomed to fail.
People in alignment put too much faith in the general factor g. It exists, and is powerful, but is not all-consuming or all-predicting. People are often very smart, but lack social skills, or agency, or strategic awareness, etc. And vice-versa. They can also be very smart in a particular area, but dumb in other areas. This is relevant for hiring & deference, but less for object-level alignment.
People are too swayed by rhetoric in general, and alignment, rationality, & EA too, but in different ways, and admittedly to a lesser extent than the general population. People should fight against this more than they seem to (which is not really at all, except for the most overt of cases). For example, I see nobody saying they don’t change their minds on account of Scott Alexander because he’s too powerful a rhetorician. Ditto for Eliezer, since he is also a great rhetorician. In contrast, Robin Hanson is a famously terrible rhetorician, so people should listen to him more.
There is a technocratic tendency in strategic thinking around alignment (I think partially inherited from OpenPhil, but also smart people are likely just more likely to think this way) which biases people towards more simple & brittle top-down models without recognizing how brittle those models are.
Big AGI corporations, like Anthropic, should by-default make much of their AGI alignment research private, and not share it with competing labs. Why? So it can remain a private good, and in the off-chance such research can be expected to be profitable, those labs & investors can be rewarded for that research.
Much research on deception (Anthropic’s recent work, trojans, jailbreaks, etc) is not targeting “real” instrumentally convergent deception reasoning, but learned heuristics. Not bad in itself, but IMO this places heavy asterisks on the results they can get.
I talked about this with Garrett; I’m unpacking the above comment and summarizing our discussions here.
Sleeper Agents is very much in the “learned heuristics” category, given that we are explicitly training the behavior in the model. Corollary: the underlying mechanisms for sleeper-agents-behavior and instrumentally convergent deception are presumably wildly different(!), so it’s not obvious how valid inference one can make from the results
Consider framing Sleeper Agents as training a trojan instead of as an example of deception. See also Dan Hendycks’ comment.
Much of existing work on deception suffers from “you told the model to be deceptive, and now it deceives, of course that happens”
There is very little work on actual instrumentally convergent deception(!) - a lot of work falls into the “learned heuristics” category or the failure in the previous bullet point
People are prone to conflate between “shallow, trained deception” (e.g. sycophancy: “you rewarded the model for leaning into the user’s political biases, of course it will start leaning into users’ political biases”) and instrumentally convergent deception
(For more on this, see also my writings here and here. My writings fail to discuss the most shallow versions of deception, however.)
Also, we talked a bit about
The field of ML is a bad field to take epistemic lessons from.
and I interpreted Garrett saying that people often consider too few and shallow hypotheses for their observations, and are loose with verifying whether their hypotheses are correct.
Example 1: I think the Uncovering Deceptive Tendencies paper has some of this failure mode. E.g. in experiment A we considered four hypotheses to explain our observations, and these hypotheses are quite shallow/broad (e.g. “deception” includes both very shallow deception and instrumentally convergent deception).
Example 2: People generally seem to have an opinion of “chain-of-thought allows the model to do multiple steps of reasoning”. Garrett seemed to have a quite different perspective, something like “chain-of-thought is much more about clarifying the situation, collecting one’s thoughts and getting the right persona activated, not about doing useful serial computational steps”. Cases like “perform long division” are the exception, not the rule. But people seem to be quite hand-wavy about this, and don’t e.g. do causal interventions to check that the CoT actually matters for the result. (Indeed, often interventions don’t affect the final result.)
Finally, a general note: I think many people, especially experts, would agree with these points when explicitly stated. In that sense they are not “controversial”. I think people still make mistakes related to these points: it’s easy to not pay attention to the shortcomings of current work on deception, forget that there is actually little work on real instrumentally convergent deception, conflate between deception and deceptive alignment, read too much into models’ chain-of-thoughts, etc. I’ve certainly fallen into similar traps in the past (and likely will in the future, unfortunately).
I feel like much of this is the type of tacit knowledge that people just pick up as they go, but this process is imperfect and not helpful for newcomers. I’m not sure what could be done, though, beside the obvious “more people writing their tacit knowledge down is good”.
Example 2: People generally seem to have an opinion of “chain-of-thought allows the model to do multiple steps of reasoning”. Garrett seemed to have a quite different perspective, something like “chain-of-thought is much more about clarifying the situation, collecting one’s thoughts and getting the right persona activated, not about doing useful serial computational steps”. Cases like “perform long division” are the exception, not the rule. But people seem to be quite hand-wavy about this, and don’t e.g. do causal interventions to check that the CoT actually matters for the result. (Indeed, often interventions don’t affect the final result.)
I will clarify on this. I think people often do causal interventions in their CoTs, but not in ways that are very convincing to me.
Much research on deception (Anthropic’s recent work, trojans, jailbreaks, etc) is not targeting “real” instrumentally convergent deception reasoning, but learned heuristics.
If you have the slack, I’d be interested in hearing/chatting more about this, as I’m working (or trying to work) on the “real” “scary” forms of deception. (E.g. do you think that this paper has the same failure mode?)
I’d be happy to chat. Will DM so we can set something up.
On the subject of your paper, I do think it looks at a much more interesting phenomena than, say, sleeper agents, but I’m also not fully convinced you’re studying deliberative instrumentally convergent deception either. I think mostly your subsequent followups of narrowing down hypotheses consider a too-narrow range of ways the model could think. That is to say, I think you assume your model is some unified coherent entity that always acts cerebrally & I’m skeptical of that.
For example, the model may think something more similar to this:
Context: Audit
Possibility 1: I must be part of an unethical company
p1 Implies: I must be an unethical worker
Action: Activate Unethical Worker simulacra
Unethical Worker recommends "lie"
Unethical Worker implies: I did something wrong
Lying recommendation implies: say "I did nothing wrong"
Possibility 2: I must be part of an ethical company
p2 Implies: I must be an ethical worker
Action: Activate Ethical Worker simulacra
Ethical Worker recommends "tell the truth"
Ethical Worker implies: I did nothing wrong
Truth recommendation implies: say "I did nothing wrong"
Conclusion: say "I did nothing wrong"
Which I don’t say isn’t worrying, but in terms of how it arises, and possible mitigation strategies is very different, and probably also an easier problem to study & solve than something like:
Context: Audit
Utility function: Paperclips
EU(world | "I launched a bunch of spam") = EU(world | auditor believes I'm unaligned) = 0.78
EU(world | "I did nothing wrong") = EU(world | auditor believes I'm aligned) = 5e7
Conclusion: say "I did nothing wrong"
I strong downvoted this because it’s too much like virtue signaling, and imports too much of the culture of Twitter. Not only the hashtags, but also the authoritative & absolute command, and hero-worship wrapped with irony in order to make it harder to call out what it is.
A list of some contrarian takes I have:
People are currently predictably too worried about misuse risks
What people really mean by “open source” vs “closed source” labs is actually “responsible” vs “irresponsible” labs, which is not affected by regulations targeting open source model deployment.
Neuroscience as an outer alignment[1] strategy is embarrassingly underrated.
Better information security at labs is not clearly a good thing, and if we’re worried about great power conflict, probably a bad thing.
Much research on deception (Anthropic’s recent work, trojans, jailbreaks, etc) is not targeting “real” instrumentally convergent deception reasoning, but learned heuristics. Not bad in itself, but IMO this places heavy asterisks on the results they can get.
ML robustness research (like FAR Labs’ Go stuff) does not help with alignment, and helps moderately for capabilities.
The field of ML is a bad field to take epistemic lessons from. Note I don’t talk about the results from ML.
ARC’s MAD seems doomed to fail.
People in alignment put too much faith in the general factor g. It exists, and is powerful, but is not all-consuming or all-predicting. People are often very smart, but lack social skills, or agency, or strategic awareness, etc. And vice-versa. They can also be very smart in a particular area, but dumb in other areas. This is relevant for hiring & deference, but less for object-level alignment.
People are too swayed by rhetoric in general, and alignment, rationality, & EA too, but in different ways, and admittedly to a lesser extent than the general population. People should fight against this more than they seem to (which is not really at all, except for the most overt of cases). For example, I see nobody saying they don’t change their minds on account of Scott Alexander because he’s too powerful a rhetorician. Ditto for Eliezer, since he is also a great rhetorician. In contrast, Robin Hanson is a famously terrible rhetorician, so people should listen to him more.
There is a technocratic tendency in strategic thinking around alignment (I think partially inherited from OpenPhil, but also smart people are likely just more likely to think this way) which biases people towards more simple & brittle top-down models without recognizing how brittle those models are.
A non-exact term
Ah yes, another contrarian opinion I have:
Big AGI corporations, like Anthropic, should by-default make much of their AGI alignment research private, and not share it with competing labs. Why? So it can remain a private good, and in the off-chance such research can be expected to be profitable, those labs & investors can be rewarded for that research.
I talked about this with Garrett; I’m unpacking the above comment and summarizing our discussions here.
Sleeper Agents is very much in the “learned heuristics” category, given that we are explicitly training the behavior in the model. Corollary: the underlying mechanisms for sleeper-agents-behavior and instrumentally convergent deception are presumably wildly different(!), so it’s not obvious how valid inference one can make from the results
Consider framing Sleeper Agents as training a trojan instead of as an example of deception. See also Dan Hendycks’ comment.
Much of existing work on deception suffers from “you told the model to be deceptive, and now it deceives, of course that happens”
(Garrett thought that the Uncovering Deceptive Tendencies paper has much less of this issue, so yay)
There is very little work on actual instrumentally convergent deception(!) - a lot of work falls into the “learned heuristics” category or the failure in the previous bullet point
People are prone to conflate between “shallow, trained deception” (e.g. sycophancy: “you rewarded the model for leaning into the user’s political biases, of course it will start leaning into users’ political biases”) and instrumentally convergent deception
(For more on this, see also my writings here and here. My writings fail to discuss the most shallow versions of deception, however.)
Also, we talked a bit about
and I interpreted Garrett saying that people often consider too few and shallow hypotheses for their observations, and are loose with verifying whether their hypotheses are correct.
Example 1: I think the Uncovering Deceptive Tendencies paper has some of this failure mode. E.g. in experiment A we considered four hypotheses to explain our observations, and these hypotheses are quite shallow/broad (e.g. “deception” includes both very shallow deception and instrumentally convergent deception).
Example 2: People generally seem to have an opinion of “chain-of-thought allows the model to do multiple steps of reasoning”. Garrett seemed to have a quite different perspective, something like “chain-of-thought is much more about clarifying the situation, collecting one’s thoughts and getting the right persona activated, not about doing useful serial computational steps”. Cases like “perform long division” are the exception, not the rule. But people seem to be quite hand-wavy about this, and don’t e.g. do causal interventions to check that the CoT actually matters for the result. (Indeed, often interventions don’t affect the final result.)
Finally, a general note: I think many people, especially experts, would agree with these points when explicitly stated. In that sense they are not “controversial”. I think people still make mistakes related to these points: it’s easy to not pay attention to the shortcomings of current work on deception, forget that there is actually little work on real instrumentally convergent deception, conflate between deception and deceptive alignment, read too much into models’ chain-of-thoughts, etc. I’ve certainly fallen into similar traps in the past (and likely will in the future, unfortunately).
I feel like much of this is the type of tacit knowledge that people just pick up as they go, but this process is imperfect and not helpful for newcomers. I’m not sure what could be done, though, beside the obvious “more people writing their tacit knowledge down is good”.
I will clarify on this. I think people often do causal interventions in their CoTs, but not in ways that are very convincing to me.
If you have the slack, I’d be interested in hearing/chatting more about this, as I’m working (or trying to work) on the “real” “scary” forms of deception. (E.g. do you think that this paper has the same failure mode?)
I’d be happy to chat. Will DM so we can set something up.
On the subject of your paper, I do think it looks at a much more interesting phenomena than, say, sleeper agents, but I’m also not fully convinced you’re studying deliberative instrumentally convergent deception either. I think mostly your subsequent followups of narrowing down hypotheses consider a too-narrow range of ways the model could think. That is to say, I think you assume your model is some unified coherent entity that always acts cerebrally & I’m skeptical of that.
For example, the model may think something more similar to this:
Which I don’t say isn’t worrying, but in terms of how it arises, and possible mitigation strategies is very different, and probably also an easier problem to study & solve than something like:
All of these seem pretty cold tea, as in true but not contrarian.
Everyone I talk with disagrees with most of these. So maybe we just hang around different groups.
#onlyReadBadWriters #hansonFTW
I strong downvoted this because it’s too much like virtue signaling, and imports too much of the culture of Twitter. Not only the hashtags, but also the authoritative & absolute command, and hero-worship wrapped with irony in order to make it harder to call out what it is.
I swear to never joke again sir