I like the framing of “how can we disrupt collusion?”, but I’m a little skeptical that some of the solutions are practical. Specifically:
B2: Collusion can happen between agents that are copies of each other, who then have no reason to defect (so long as they employ a decision theory with a broad enough cartesian boundary). If we knew we could engineer systems with narrow decision theories (e.g. “Even though that’s a copy of me, I’ll pretend it’s a different agent”) I’d feel much better here, but as far as I know we don’t know how to do that. I’d be excited to hear about work in that direction.
B4: I’m skeptical that we can really constrain communication meaningfully between e.g. models interacting over the internet. And I do expect a lot of models to get deployed over the internet, because that’s where the economic value is.
B6: Some of the most powerful models are LLM’s, which get trained on a large chunk of all of the text ever written. It seems they’ll share an awful lot of knowledge, though maybe you’re considering very specific information they’ll lack?
C1: I don’t think this is born out by the models in use today, but I don’t know for sure. E.g. I would bet that text-davinci does better than a combination of smaller models with diverse training processes, but I don’t know if that comparison has been done.
From the Afterword: “Note that I have presented a there-exists argument against a for-all proposition. Responses of the form “But I can think of something similar that won’t work” don’t count as counterarguments.” [This has been edited in the main text to sharpen the phrasing.]
B2: Copies might collude with one another, provided they have a broad decision theory: ∃(some sets of models) among which some mechanisms for impeding collusion might not work.
B4: A lot of models may be deployed over the internet: ∃(some communication channels) for which one of 6 mechanisms for impeding collusion might not work (given flawed security).
B6: LLM’s will share a lot of knowledge: ∃(high performance models for some tasks) for which one of 6 mechanisms for impeding collusion doesn’t contribute to safety.
C1: Large models tend to beat compositions of small models for some tasks: ∃(some tasks) where some models should be large, and narrow knowledge won’t contribute to impeding collusion.
Not too worrisome, I think, but these are all considerations important to building robust multicomponent systems.
BTW, regarding the (non-)goal oriented nature of LLMs, I highly recommend the “Simulators” post by Janus.
From the Afterword: “Note that I have presented a there-exists argument against a for-all proposition. Responses of the form “But I can think of something similar that won’t work” don’t count as counterarguments.” [This has been edited in the main text to sharpen the phrasing.]
I saw this, but I think it sets a somewhat unhelpful standard. In practice we need to make choices about which approaches are most promising, which to pursue, etc., and evidence that there is more probability mass on success in one area does feel useful.
So, for instance, my point in B2 is (as you correctly interpreted) not that there literally can’t exist models that don’t exhibit copy-collusion. Rather, I think copy-collusion is reasonably likely, and we don’t currently know of a way to prevent it.
Yes, we need to weigh the considerations you mention, and many others.
I think copy-collusion is reasonably likely, and we don’t currently know of a way to prevent it.
Copy-collusion must be considered carefully, and its implications will depend on both potential contexts for using copies (sometimes problematic, sometimes OK) and on potential alternatives to using copies.
Copies will always “collude” in the trivial sense of giving identical, potentially deceptive responses to identical inputs, but might also offer potentially deceptive, mutually-supporting responses to differing inputs. For some purposes, a multicomponent architecture must treat multiple copies as a single source of information. (Note that solving the alignment problem is often taken to require that we be able to trust a single copy, or a system that consists of nothing but copies of a single entity — this seems strictly riskier than anything I’ve suggested.)
Some constraints can be effective even among copies: Constrained communication can be applied to preclude coordination that requires new, shared knowledge. Lack of access to a shared clock can be applied to preclude coordination of simultaneous or cleverly sequenced actions. Systems that audit alternative behaviors by re-running processes from checkpoints can be applied to offline copies, where adaptive, multi-trial probing (by what may be highly intelligent systems!) can likely be leveraged to make some kinds of deceptions unwise.
Of course a central idea is that many roles call for systems that are not copies: proposers vs. critics, designers vs. testers, actors vs. auditors, diverse members of ensembles, and so on.
we need to make choices about which approaches are most promising, which to pursue, etc.
Before asking whether one approach is superior to another, we should ask whether they are competitors or complements. Multicomponent approaches for making non-collusion robust seem quite complementary to strategies for making individual systems more trustworthy.
Regarding criteria for investment, neglectedness is of course a huge consideration.
I like the framing of “how can we disrupt collusion?”, but I’m a little skeptical that some of the solutions are practical. Specifically:
B2: Collusion can happen between agents that are copies of each other, who then have no reason to defect (so long as they employ a decision theory with a broad enough cartesian boundary). If we knew we could engineer systems with narrow decision theories (e.g. “Even though that’s a copy of me, I’ll pretend it’s a different agent”) I’d feel much better here, but as far as I know we don’t know how to do that. I’d be excited to hear about work in that direction.
B4: I’m skeptical that we can really constrain communication meaningfully between e.g. models interacting over the internet. And I do expect a lot of models to get deployed over the internet, because that’s where the economic value is.
B6: Some of the most powerful models are LLM’s, which get trained on a large chunk of all of the text ever written. It seems they’ll share an awful lot of knowledge, though maybe you’re considering very specific information they’ll lack?
C1: I don’t think this is born out by the models in use today, but I don’t know for sure. E.g. I would bet that text-davinci does better than a combination of smaller models with diverse training processes, but I don’t know if that comparison has been done.
From the Afterword: “Note that I have presented a there-exists argument against a for-all proposition. Responses of the form “But I can think of something similar that won’t work” don’t count as counterarguments.” [This has been edited in the main text to sharpen the phrasing.]
B2: Copies might collude with one another, provided they have a broad decision theory: ∃(some sets of models) among which some mechanisms for impeding collusion might not work.
B4: A lot of models may be deployed over the internet: ∃(some communication channels) for which one of 6 mechanisms for impeding collusion might not work (given flawed security).
B6: LLM’s will share a lot of knowledge: ∃(high performance models for some tasks) for which one of 6 mechanisms for impeding collusion doesn’t contribute to safety.
C1: Large models tend to beat compositions of small models for some tasks: ∃(some tasks) where some models should be large, and narrow knowledge won’t contribute to impeding collusion.
Not too worrisome, I think, but these are all considerations important to building robust multicomponent systems.
BTW, regarding the (non-)goal oriented nature of LLMs, I highly recommend the “Simulators” post by Janus.
I saw this, but I think it sets a somewhat unhelpful standard. In practice we need to make choices about which approaches are most promising, which to pursue, etc., and evidence that there is more probability mass on success in one area does feel useful.
So, for instance, my point in B2 is (as you correctly interpreted) not that there literally can’t exist models that don’t exhibit copy-collusion. Rather, I think copy-collusion is reasonably likely, and we don’t currently know of a way to prevent it.
Yes, we need to weigh the considerations you mention, and many others.
Copy-collusion must be considered carefully, and its implications will depend on both potential contexts for using copies (sometimes problematic, sometimes OK) and on potential alternatives to using copies.
Copies will always “collude” in the trivial sense of giving identical, potentially deceptive responses to identical inputs, but might also offer potentially deceptive, mutually-supporting responses to differing inputs. For some purposes, a multicomponent architecture must treat multiple copies as a single source of information. (Note that solving the alignment problem is often taken to require that we be able to trust a single copy, or a system that consists of nothing but copies of a single entity — this seems strictly riskier than anything I’ve suggested.)
Some constraints can be effective even among copies: Constrained communication can be applied to preclude coordination that requires new, shared knowledge. Lack of access to a shared clock can be applied to preclude coordination of simultaneous or cleverly sequenced actions. Systems that audit alternative behaviors by re-running processes from checkpoints can be applied to offline copies, where adaptive, multi-trial probing (by what may be highly intelligent systems!) can likely be leveraged to make some kinds of deceptions unwise.
Of course a central idea is that many roles call for systems that are not copies: proposers vs. critics, designers vs. testers, actors vs. auditors, diverse members of ensembles, and so on.
Before asking whether one approach is superior to another, we should ask whether they are competitors or complements. Multicomponent approaches for making non-collusion robust seem quite complementary to strategies for making individual systems more trustworthy.
Regarding criteria for investment, neglectedness is of course a huge consideration.