Let’s call the thing where you try to take actions that make everyone/yourself less dead (on expectation) the “safety game”. This game is annoyingly chaotic, kind of like Arimaa.
You write the sequences then some risk-averse not-very-power-seeking nerds read it and you’re 10x less dead. Then Mr. Altman reads it and you’re 10x more dead. Then maybe (or not) there’s a backlash and the numbers change again.
You start a cute political movement but the countermovement ends up being 10x more actionable (e/acc).
You try to figure out and explain some of the black box but your explanation is immediately used to make a stronger black box. (Mamba possibly.)
Etc.
I’m curious what folks use as toeholds for making decisions in such circumstances. Or if some folks believe there are actually principles then I would like to hear them, but I suspect the fog is too thick. I’ll skip giving my own answer on this one.
I tried thinking of principles, but it was hard to find ones specific to this. There’s one obvious ‘default’ one at least (default as in it may be overridden by situation).
Secrecy
Premises:
Model technical knowledge progress (such as about alignment) as concavely/diminishingly increasing with collaboration group size and member <cognitive traits>[1],
because humans are mostly the same entity[2], so naively we wouldn’t expect more humans to perform significantly better[3], but..
humans do still seem to technically progress faster when collaborating.
Combine with unilateralist effect
Combine with it being less hard/specific to create an unaligned than aligned superintelligent agent (otherwise the unilateralist effect would work in the opposite direction).
Implies positive but not negative value of sharing information publicly is diminished if there is already a group trying to utilize the information. If so, may imply ideal is various individual, small or medium-sized alignment-focused groups which don’t publicly share their progress by default.[4]
(I do suspect humans are biased in favor of public and social collaboration, as that’s kind of what they were selected for, and in a less vulnerable world. Moreover, premise 1a (‘humans are mostly the same entity’) does contradict aspects of humanistic ontology. That’s not strong evidence for this ‘principle’, just a reason it’s probably under-considered)
Counterpoints:
On the concaveness assumption:
~ In history, technical knowledge was developed in a decentralized way, IIUC—based on my purely lay understanding of the history of knowledge progression, that was probably merely absorbed from stories and culture. If that’s true, it is evidence against the idea that a smaller group can make almost as much progress as a large one.
Differential progress:
~ there are already far more AI researchers than AI alignment researchers. While the ideal might be for this to be a highly secretive subject like how existential risks are handled in Dath Ilan, this principle cannot give rise to that.
What are principles we can use when secrecy is not enough?
My first thought is to look for principles in games such as you mentioned. But none feel too particular to this question. It returns general things like, “search paths through time”, which can similarly be used to pursue good or harmful things. This is unsatisfying.
I want deeper principles, but there may be none.
Meta-principle: Symmetry: For any principle you can apply, an agent whose behavior furthers opposite thing could in theory also apply it.
To avoid symmetry, one could look for principles that are unlikely to be able to be utilized without specific intent and knowledge. One can outsmart runaway structural processes this way, for example, and I think that to a large extent AI research is a case of that.
How have runaway processes been defeated before? There are some generic ways, like social movements, that are already being attempted with superintelligent agent x-risk. Are there other, less well known or expected ways? And did these ways reduce to generic, ‘searching paths through time’, or is there a pattern to them which could be studied and understood?
There are some clever ideas for doing something like that which come to mind. E.g., the “confrontation-worthy empathy” section of this post.
It’s hard for me to think of paths through time more promising than just, ‘try to solve object-level alignment’, though, let alone the principles which could inspire them (e.g., idk what principle the linked thing could be a case of)
I mean things like creativity, different ways of doing cognition about problems, and standard things like working memory, ‘cognitive power’, etc.
I mean replications of the same fundamental entity, i.e humans or the structure of what a human is. And by ‘mostly’ I mean of course there are differences too. I think evolution implies human minds will tend to be more reflectively aware of the differences because the sameness can operate as an unnoticed background assumption.
Like we’d not expect asking 10 ChatGPT-3.5s instead of just one to do significantly better. Less true with humans because they were still selected to be different and collaborate.
(and this may be close to the situation already?)
I have had this same question for a while, and this is the general conclusion I’ve come to:
Identify the safety issues today, solve them, and then assume the safety issues scale as the technology scales, and either amp up the original solution, or develop new tactics to solve these extrapolated flaws.
This sounds a little vague, so here is an example: We see one of the big models misrepresent history in an attempt to be woke, and maybe it gives a teenager a misconception of history. So, the best thing we can do from a safety perspective is figure out how to train models to absolutely represent facts. After this is done, we can extrapolate the flaw up to a model deliberately feeding misinformation to achieve a certain goal, and we can try to use the same solution we used for the smaller problem for the bigger problem, or if we see it won’t work, develop a new solution.
The biggest problem with this, is it is reactionary, and if you only use this method, a danger may present itself for the first time, and already cause major harm.
I know this approach isn’t as effective for xrisk, but still, it’s something I like to use. Easy to say though, coming from someone who doesn’t actually work in AI safety.
This sentence has the grammatical structure of acknowledging a counterargument and negating it—“I know x, but y”—but the y is “it’s something I like to use”, which does not actually negate the x.
This is a kind of thing I suspect results from a process like: someone writes out the structure of negation, out of wanting to negate an argument, but then finds nothing stronger to slot into where the negating argument is supposed to be.
The things you mentioned were probably all net positive, they just had some negative consequences as well. If you want to drive the far-ish future in a particular direction you’ve just got to accept that you’ll never know for sure that you’re doing a good job.
I don’t really have a good idea of the principles, here. Personally, whenever I’ve made a big difference in a person’s life (and it’s been obvious to me that I’ve done so), I try to take care of them as much as I can and make sure they’re okay.
...However, I have ran into a couple issues with this. Sometimes someone or something takes too much energy, and some distance is healthier. I don’t know how to judge this other than intuition, but I think I’ve gone too far before?
And I have no idea how much this can scale. I think I’ve had far bigger impacts than I’ve intended, in some cases. One time I had a friend who was really in trouble and I had to go to pretty substantial lengths to get them to a better place, and I’m not sure all versions of them would’ve endorsed that, even if they do now.
...But, broadly, “do what you can to empower other people to make their own decisions, when you can, instead of trying to tell them what to do” does seem like a good principle, especially for the people who have more power in a given situation? I definitely haven’t treated this as an absolute rule, but in most cases I’m pretty careful not to stray from it.
There’s a complication where sometimes it’s very difficult to get people not to interpret things as an instruction. “Confuse them” seems to work, I guess, but it does have drawbacks too.