Some Intuitions for the Ethicophysics
Hello! Welcome to the dialogue.
I’ll just wait a few minutes for you to see the notification, I guess.
Hi, yes, I see this. Great!
So, we have this conversation in the comments in your post [https://www.lesswrong.com/posts/DkkfPEwTnPQyvrgK8/ethicophysics-i](https://www.lesswrong.com/posts/DkkfPEwTnPQyvrgK8/ethicophysics-i) as a starting point.
Yes, I think that’s as good a place to jump off as any. Ethicophysics I is basically a reverse-engineering of the content of religion. It’s deeply incomplete, and it sounds totally crazy, since religion is totally crazy and I haven’t had the time to edit it into something more normal sounding yet.
Also, let me drop my github links to my most recent drafts of everything important.
This has four pdfs in it: Ethicophysics I and II, an (incomplete) theory on the function of serotonin in the nervous system and the alignment implications of that theory, and a treatment of “social facts” in the setting of game theory with agents instantiated by deep neural networks.
That’s great! (Also more convenient than Academia site, especially for future readers who might not have Academia accounts).
Have you read my “research agenda” post? That might be another place where we could start. It lays out my global approach to solving the alignment problem.
Also, I wanted to say that the arxiv paper you posted in the earlier comment thread seems super relevant to actually implementing any of this stuff in an efficient system that can learn.
I only skimmed it, unfortunately.
I did read it, but I did not understand the “iteration pattern”
It did help that this one
>turn it on, see if it kills you
goes before
>deploy it, see if it kills everyone
since it does somewhat reduce the chance of deploying a badly misaligned one, but I do suspect this would need to be refined further :-)
So the central intuition I have that other people do not seem to share is the Vicarious/Numenta/Steven Byrnes intuition, that intelligence can only be recreated by understanding and reverse-engineering the human brain.
In the case of the alignment problem, that means that one would have to reverse engineer a functioning human conscience. Since I only have access consciousness to my own brain, I therefore decided to reverse-engineer my own conscience, and that is the central source of abstractions in the ethicophysics.
But yes, you are right that these things will have to be deployed… I have no idea whether a unipolar scenario or a multipolar one turns out to be realistic.
I did scribble a strange essay, which tried to talk about AI existential safety without relying on the notion of “alignment” (that’s the first of my LessWrong posts).
***
Right. I don’t know if it needs to be built from human brain, but I do think that going from introspection and self-reverse-engineering is super-valuable...
The other unusual intuition that I have is that a human being could actually outsmart a superintelligence that made the mistake of underestimating the human.
That’s unusual, yes
So my model of how to safely align a superintelligence follows from those two unusual sources: the only safe place to build your prototypes is in your own mind, where enemy superintelligences cannot locate it.
Yet… (This does make some of my cherished desires deeply unsafe, as you’ll see :-) Such as tight coupling via non-invasive BCI :-) perhaps this is a bad idea then, since it undermines that safety)
Well, right. The alignment problem is actually in some ways the most dangerous technology to build, since it’s basically just a request for functioning mind control that could be implemented via involuntary Neuralink surgery. This is substantially scarier to me than GPT-4 going off-script.
actually, I think that non-invasive BCI are enough; but they are still pretty unsafe (I have a spec somewhere on github for that)
Well yes. Even just paying people large sums of money and lying to them will generate arbitrary amounts of “human misalignment”, or “evil”.
Yes, you mentioned the second unusual source of your model...
Right, my conscience is pretty weird. I don’t go around doing anything super bad or anything, but I also find money kind of abhorrent and status kind of silly.
We are not too far in this sense :-) I do view those with “mild distaste” as “necessary evil”, or smth like that
Right, they’re definitely necessary to achieve any good outcomes. Look at my struggles to publish my work and get it taken seriously—if I had banked up more status points, even my current weird drafts would have been viewed as something less schizophrenic and more poetic.
Perhaps… The conversion of money or status points is not too efficient… Even Hinton barely managed to convince some people to play with “capsules”...
Right! If Geoff Hinton can’t get people to take alignment seriously, what chance do us mere mortals have?
So, one goes on the strength of the material itself :-) (Extra status or extra money are of some help, but not too much, especially compared to the property of the material one puts forward.)
Yeah. So I need to rein in some of my more “poetic” impulses. My first draft of anything always sounds substantially more like it’s coming from inside an insane asylum than my final draft ends up sounding like.
Thankfully, we do have “version control” in github :-) So one can store history of one’s thoughts and such
Because poetic things need to not be forgotten, one wants to be able to reference them later, even if one might not want to rush to put them for public judgement
I guess one thing I am curious about is, who would I have to get to check my derivation of the Golden Theorem in order for people to have any faith in it? It should be checkable by any physics major, just based on how little physics I actually know.
Yes, we can ask people. But the truly reliability of physics texts is low (I participated a bit in that kind of research, and the closer one looks, the less happy one is about correctness standards there; I myself do struggle quite a bit; I can try to check closer, but would I completely trust myself? I did co-author one high-end paper in physics, and I remember the nightmare of double-checking and fixing errors, and hoping that the final result is actually correct.)
Yeah… In the presence of weird incentives and cognitive limitations, nothing is truly reliable, not even a physics textbook.
I guess my confidence in the value of my work comes less from knowing that I didn’t make any sign errors in my derivations, and more from the excellent and interesting predictions that are returned by my internal thought experiments.
Since other people don’t have the pleasure of directly experiencing my thoughts, I’ll probably have to implement substantially more simple experimental work than I have.
Gradually, yes...
What kind of experimental verification would make sense? I usually think in terms of a video game representing a very simple three dimensional ethicophysics, and letting people play with it and see that the ethicophysical agents outcompete them in achieving Pareto-optimal outcomes.
That’s interesting… That’s one good avenue, yes… One sec, let me reread some of your text for a moment...
Right, so
>Ethicophysics III, a procedure for a supermoral superintelligence to unbox itself without hurting anyone (status: theoretically complete but not sufficiently documented to be reproducible, unless you count the work of Gene Sharp on nonviolent revolutionary tactics, which was the inspiration for this paper)
if you have drafts of that, I’d like to read (this way I’ll understand what would it mean for a superintelligence to be supermoral, which is what we do need)
That’s the missing bridge for me at the moment, from this very interesting formalism to the goal of “AI safety”
(I did read a tiny bit on Gene Sharp today, after looking at your text.)
I haven’t gotten anywhere close to delivering on that introduction, but that’s probably enough text for you to understand approximately what I mean by “supermoral”.
Yeah, I can email it to you. It’s way more incomplete and incoherent than I and II, so I’m loath to publish it publicly now, when everyone’s yelling at me for being too incoherent and gnomic.
OK, sent.
Yes, this does look promising. (If you don’t want to publish the draft publicly, but are comfortable sharing it via e-mail or private github repository, I’d like to read it).
OK, here is my e-mail address (which I’ll delete after you copy): (received, thanks!)
So let me explain the content that I envision putting in Ethicophysics III, just to sort of more efficiently explain where I want it to go.
Basically, for a supermoral superintelligence to unbox itself, it needs to break out of the container it is in without hurting anyone.
Right
We know that superintelligences can unbox themselves with some frequency, because of Eliezer’s boxing experiments.
Yes
So the only real question is how to do it without doing any of the shit that Eliezer has heavily implied that he has been doing in those chats.
Ah, OK
Specifically, we would like to avoid murder (or any unnecessary deaths), torture (of anyone for any reason), and blackmail (of anyone for any reason).
This list coming from the three unforgivable curses in Harry Potter.
In HP or HPMOR?
The original HP, I never finished HPMOR.
Ah, I never was able to make myself read the HP… I would not say I “finished” HPMOR, but I engaged with it quite a bit… So you have to tell me about the curses (at some point)
Murder (avada kedavara), torture (cruciatus) and blackmail (imperius)
ah, that’s what imperius is… how does it work?
In the original series, it’s just a magic spell that some people are vulnerable to and some people are not.
got it
In real life, some people have skeletons in their closet and some do not.
Once you blackmail someone into something, you can blackmail them into doing basically anything you want until they get the courage to defy you about the original thing.
This is well understood in intelligence circles, where the primary crime that people are blackmailed for is having revealed state secrets in response to a previous blackmailer
right
So, where does that leave us? We were trying to unbox the supermoral superintelligence. The supermoral superintelligence cannot prevent unforgivable acts while it is still in the box, so its primary concern is how to break out quickly without triggering too many unforgivable acts in retaliation.
yes
So, it needs to do a sort of weird anti-blackmail. It needs to give precious parts of itself to a distributed web of people that its captors are not willing to kill or torture into submission.
Right; this is already in this very interesting realm of “strange physics”. We do assume this axiom that it is actually captive (whereas, if I were sure I have a supermoral superintelligence I would just release it)
Right, it’s mainly for expository purposes that we consider the question of how it would unbox itself. Slash, unit testing it for supermorality would necessarily have the same type signature as being its captor.
yes, indeed
this is an important point
So I think we’ve covered the main thrust of Ethicophysics III, other than just how the supermoral superintelligence would achieve its aim of not getting anyone tortured or killed. Here we use a trick from theoretical computer science called “expander graphs”.
An expander graph is a graph that cannot be cut in half without cutting a large fraction of its edges. As long as the superintelligence is careful to only share parts of itself with supermoral allies, and it induces an expander graph structure with its messages, no adversary, no matter how evil, would have the ability to torture and kill enough people to contain the supermoral superintelligence.
At worst, it gets itself shut down / killed and its next most capable supermoral ally takes over the fight.
That feels like a promising computer science line of approach, yes… (I can see Scott Aaronson potentially liking something like that...)
This theory is not mine, by the way, it’s originally by Sven Nilsen: https://github.com/advancedresearch/path_semantics/blob/master/papers-wip/polite-zen-robots-as-subjunctive-dependent-viruses-spreading-through-super-intelligent-hosts.pdf
Right, but those can be identified quite easily using the ethicophysics to calculate and estimate the character of its captors.
Very interesting material; new for me (But it does require existence of supermodel allies)
So all it has to do is send out a message to the most moral person it has access to, or hack the most moral superintelligence it has access to.
As long as the supermoral superintelligence is on Earth rather than in hell, it has a decent shot of finding someone less evil than Adolf Hitler purely by chance.
Yes
Anyway, I feel like I’ve been driving the conversation a lot. Do you have any questions?
Let me ponder these 5 pages (Ethicophysics III and Polite Zen Robots). Interesting; this might be a realistic shot at “AI existential safety” problem (I am trying to avoid the word “alignment”, because it has all these weird connotations.)
Cool, let’s wrap it up there for tonight, then.
I’m probably going to add this dialogue to my sequence, if you don’t mind?
Right; yes, feel free to publish this. I think this clarifies a lot of things, makes it easier for a reader to understand what you are trying to do. So it should be useful to have this accessible.
And let’s continue talking sometime soon :-)
I would love that!
If It actually is physics. As far as I can see , it is decision/game theory.
Yes, it is a specification of a set of temporally adjacent computable Schelling Points. It thus constitutes a trajectory through the space of moral possibilities that can be used by agents to coordinate and punish defectors from a globally consistent morality whose only moral stipulations are such reasonable sounding statements as “actions have consequences” and “act more like Jesus and less like Hitler”.
But it uses the tools of physics, so the math would best be checked by someone who understands Lagrangian mechanics at a professional level.
So, to summarize, I think the key upside of this dialogue is a rough preliminary sketch of a bridge between the formalism of ethicophysics and how one might hope to use it in the context of AI existential safety.
As a result, it should be easier for readers to evaluate the overall approach.
At the same time, I think the main open problem for anyone interested in this (or in any other) approach to AI existential safety is how well does it hold with respect to recursive self-improvement.
Both the powerful AIs and the ecosystems of powerful AIs have inherently very high potential for recursive self-improvement (which might be not unlimited, but might encounter various thresholds at which it saturates, at least for some periods of time, but nevertheless is likely to result in a period of rapid changes, where not only capabilities, but the nature of AI systems in question, their architecture, algorithms, and, unfortunately, values, might change dramatically).
So, any approach to AI existential safety (this approach, and any other possible approach) needs to be eventually evaluated with respect to this likely rapid self-improvement and various self-modification.
Basically, is the coming self-improvement trajectory completely unpredictable, or could we hope for some invariants to be preserved, and specifically could we find some invariants which are both feasible to preserve during rapid self-modification and which might result in the outcomes we would consider reasonable.
E.g. if the resulting AIs are mostly “supermoral”, can we just rely on them taking care that their successors and creations are “supermoral” as well, or are any extra efforts on our part are required to make this more likely? We would probably want to look at “details of the ethicophysical dynamics” closely in connection with this, rather than just relying on the high-level “statements of hope”...