I tried thinking of principles, but it was hard to find ones specific to this. There’s one obvious ‘default’ one at least (default as in it may be overridden by situation).
Secrecy Premises:
Model technical knowledge progress (such as about alignment) as concavely/diminishingly increasing with collaboration group size and member <cognitive traits>[1],
because humans are mostly the same entity[2], so naively we wouldn’t expect more humans to perform significantly better[3], but..
humans do still seem to technically progress faster when collaborating.
Combine with it being less hard/specific to create an unaligned than aligned superintelligent agent (otherwise the unilateralist effect would work in the opposite direction).
Implies positive but not negative value of sharing information publicly is diminished if there is already a group trying to utilize the information. If so, may imply ideal is various individual, small or medium-sized alignment-focused groups which don’t publicly share their progress by default.[4]
(I do suspect humans are biased in favor of public and social collaboration, as that’s kind of what they were selected for, and in a less vulnerable world. Moreover, premise 1a (‘humans are mostly the same entity’) does contradict aspects of humanistic ontology. That’s not strong evidence for this ‘principle’, just a reason it’s probably under-considered)
Counterpoints: On the concaveness assumption: ~ In history, technical knowledge was developed in a decentralized way, IIUC—based on my purely lay understanding of the history of knowledge progression, that was probably merely absorbed from stories and culture. If that’s true, it is evidence against the idea that a smaller group can make almost as much progress as a large one.
Differential progress: ~ there are already far more AI researchers than AI alignment researchers. While the ideal might be for this to be a highly secretive subject like how existential risks are handled in Dath Ilan, this principle cannot give rise to that.
What are principles we can use when secrecy is not enough?
My first thought is to look for principles in games such as you mentioned. But none feel too particular to this question. It returns general things like, “search paths through time”, which can similarly be used to pursue good or harmful things. This is unsatisfying.
I want deeper principles, but there may be none.
Meta-principle: Symmetry: For any principle you can apply, an agent whose behavior furthers opposite thing could in theory also apply it.
To avoid symmetry, one could look for principles that are unlikely to be able to be utilized without specific intent and knowledge. One can outsmart runaway structural processes this way, for example, and I think that to a large extent AI research is a case of that.
How have runaway processes been defeated before? There are some generic ways, like social movements, that are already being attempted with superintelligent agent x-risk. Are there other, less well known or expected ways? And did these ways reduce to generic, ‘searching paths through time’, or is there a pattern to them which could be studied and understood?
It’s hard for me to think of paths through time more promising than just, ‘try to solve object-level alignment’, though, let alone the principles which could inspire them (e.g., idk what principle the linked thing could be a case of)
I mean things like creativity, different ways of doing cognition about problems, and standard things like working memory, ‘cognitive power’, etc.
(I am using awkward constructions like ‘high cognitive power’ because standard English terms like ‘smart’ or ‘intelligent’ appear to me to function largely as status synonyms. ‘Superintelligence’ sounds to most people like ‘something above the top of the status hierarchy that went to double college’, and they don’t understand why that would be all that dangerous? Earthlings have no word and indeed no standard native concept that means ‘actually useful cognitive power’. A large amount of failure to panic sufficiently, seems to me to stem from a lack of appreciation for the incredible potential lethality of this thing that Earthlings as a culture have not named.)
I mean replications of the same fundamental entity, i.e humans or the structure of what a human is. And by ‘mostly’ I mean of course there are differences too. I think evolution implies human minds will tend to be more reflectively aware of the differences because the sameness can operate as an unnoticed background assumption.
Like we’d not expect asking 10 ChatGPT-3.5s instead of just one to do significantly better. Less true with humans because they were still selected to be different and collaborate.
I tried thinking of principles, but it was hard to find ones specific to this. There’s one obvious ‘default’ one at least (default as in it may be overridden by situation).
Secrecy
Premises:
Model technical knowledge progress (such as about alignment) as concavely/diminishingly increasing with collaboration group size and member <cognitive traits>[1],
because humans are mostly the same entity[2], so naively we wouldn’t expect more humans to perform significantly better[3], but..
humans do still seem to technically progress faster when collaborating.
Combine with unilateralist effect
Combine with it being less hard/specific to create an unaligned than aligned superintelligent agent (otherwise the unilateralist effect would work in the opposite direction).
Implies positive but not negative value of sharing information publicly is diminished if there is already a group trying to utilize the information. If so, may imply ideal is various individual, small or medium-sized alignment-focused groups which don’t publicly share their progress by default.[4]
(I do suspect humans are biased in favor of public and social collaboration, as that’s kind of what they were selected for, and in a less vulnerable world. Moreover, premise 1a (‘humans are mostly the same entity’) does contradict aspects of humanistic ontology. That’s not strong evidence for this ‘principle’, just a reason it’s probably under-considered)
Counterpoints:
On the concaveness assumption:
~ In history, technical knowledge was developed in a decentralized way, IIUC—based on my purely lay understanding of the history of knowledge progression, that was probably merely absorbed from stories and culture. If that’s true, it is evidence against the idea that a smaller group can make almost as much progress as a large one.
Differential progress:
~ there are already far more AI researchers than AI alignment researchers. While the ideal might be for this to be a highly secretive subject like how existential risks are handled in Dath Ilan, this principle cannot give rise to that.
What are principles we can use when secrecy is not enough?
My first thought is to look for principles in games such as you mentioned. But none feel too particular to this question. It returns general things like, “search paths through time”, which can similarly be used to pursue good or harmful things. This is unsatisfying.
I want deeper principles, but there may be none.
Meta-principle: Symmetry: For any principle you can apply, an agent whose behavior furthers opposite thing could in theory also apply it.
To avoid symmetry, one could look for principles that are unlikely to be able to be utilized without specific intent and knowledge. One can outsmart runaway structural processes this way, for example, and I think that to a large extent AI research is a case of that.
How have runaway processes been defeated before? There are some generic ways, like social movements, that are already being attempted with superintelligent agent x-risk. Are there other, less well known or expected ways? And did these ways reduce to generic, ‘searching paths through time’, or is there a pattern to them which could be studied and understood?
There are some clever ideas for doing something like that which come to mind. E.g., the “confrontation-worthy empathy” section of this post.
It’s hard for me to think of paths through time more promising than just, ‘try to solve object-level alignment’, though, let alone the principles which could inspire them (e.g., idk what principle the linked thing could be a case of)
I mean things like creativity, different ways of doing cognition about problems, and standard things like working memory, ‘cognitive power’, etc.
I mean replications of the same fundamental entity, i.e humans or the structure of what a human is. And by ‘mostly’ I mean of course there are differences too. I think evolution implies human minds will tend to be more reflectively aware of the differences because the sameness can operate as an unnoticed background assumption.
Like we’d not expect asking 10 ChatGPT-3.5s instead of just one to do significantly better. Less true with humans because they were still selected to be different and collaborate.
(and this may be close to the situation already?)