What are Good and Evil? How do we explain these concepts to acomputer sufficiently well that we can be assured that the computer will understand them in the same sense as humans understand them? These are hard questions, and people have often despaired of fnding any answersto the AI safety problem.
In this paper, we lay out a theory of ethics modeled on the laws of physics. The theory has two key advantages: it squares nicely with most human moral intuitions, and it is amenable to rather straightforward computations that a computer could easily perform if told to. It therefore forms an ideal foundation for solving the AI safety problem.
this is an odd one. I find it hard to get through this paper; it seems to have a significant amount of bad physics in it and I’m not convinced by the way you’re separating things. But I’m still interested in the kind of approach you’ve expressed interest in, and I’d expect something I like to look vaguely like this. Could you give an overview of this paper in a comment? I’d happily strong upvote if I thought this should be closer to zero.
Certainly! Also, please point out any bad physics I have committed; I never got beyond intermediate mechanics in college, so Lagrangian mechanics is really approximately the limit of what I can confidently wield, and it’s been a long time since I had anyone check my ability to do that. (In accordance with the rather specific and somewhat condescending directions supplied above the comment box, presumably triggered by my recent lack of karma for my recent posts, I would estimate a 75% chance that I didn’t make any major errors in the math/physics side that substantively change the conclusions and usefulness of the approach I am advocating.)
Introduction: we model a cyberphysical system at ethical risk, which is approximately how I would describe the alignment problem in standard academic terminology. The physical part we model with standard physics, and the cyber part we model with standard computer science. The interface between the two, and the ethical components that are required to ensure good outcomes, are modeled by a novel theory that we term the ethicophysics. In this first paper, we simply lay out the philosophy of the novel theory and prove a single theorem, the Golden Theorem, which allows us to derive a large number of ethical conservation laws. These conservation laws are only approximate and statistical laws, the way that entropy is only an approximate and statistical property in thermodynamics. Just as molecules can and do sort themselves back into lower entropy configurations from time to time, individuals and communities can choose to exercise their free will to avoid and avert any of the conservation laws proved using the Golden Theorem. In the second paper (of which only an incomplete draft exists as of yet), we lay out a number of ethical conservation laws of interest: Conservation of Bullshit, Conservation of Kitsch, Conservation of Might, Conservation of Status, Conservation of Reputation. We argue that each of these laws has both a rigorous ethicophysical theoretical proof and ample experimental evidence from human history to be considered a law.
We define God as a hypothetical omniscient, omnireasonable observer. God is expected to prefer Jesus Christ to Adolf Hitler, and otherwise simply to conform to the ethicophysical laws that we formulate. The existence / ontological status of God seems undecidable and not of any particular interest or relevance, and so we ignore the question. We define the soul to be God’s considered, reflectively consistent opinion of an individual at a given point in time. We don’t define, but will in future drafts define, honor and reputation. Honor is defined to be the opinion of an agent of its own soul, and reputation is defined to be the bundle of opinions that other agents hold about the agent. We posit (without justification) that every computational device (human or animal brain, computer, etc.) can be said to have a soul, and that it seems like a reasonably worthy assumption to consider all souls equally precious, in the spirit of the Sermon on the Mount and the Declaration of Independence. (Note that this assumption would, I believe, result in an extremely animist ethics in which e.g. hammers have an inherent right not to be carelessly chipped by a shoddy craftsman, which seems counterintuitive but not necessarily problematic.)
We define a large number of concepts relevant to ethics and corresponding to commonly used concepts in physics. We implicitly lay out (and future drafts should make this explicit) the equivalent of F=ma and “equal and opposite reaction” by assuming that people will tend to increase the amount they like people who help them. Since this seems like an empirically verified regularity in human nature, and like a good idea (it’s essentially just an implementation of the Tit-for-Tat strategy), we find these laws to be quite robust and reasonable. We define love and hate (perhaps hate would be better called fear?) as the ethicophysical equivalents of position, like and dislike as the ethicophysical equivalents of velocity, help and harm as the ethicophysical equivalents of force (and thus proportional to acceleration/help/harm). We define active subjective energy via a straightforward formula inspired by the definition of kinetic energy. We define a placeholder concept of potential subjective energy corresponding to whatever quantities and laws we will need to posit in order to make sure that subjective energy is conserved. We define weighted subjective energy to be a variant of subjective energy where the agent or subject cares more about the opinions of some people than others. (As we all must inevitably do.) We then prove the Golden Theorem (“unethical actions have real-world karmic consequences”), using a mixture of Emmy Noether’s techniques for proving conservation laws in traditional physics, and some novel arguments for proving conservation laws in the ethicophysics based on discrete, group-theoretic symmetries of ethical content based on the sort of empathetic reasoning advocated by Jesus in the Golden Rule. We note in this comment (but not in the existing draft, which we should update) that the primary mechanism for karma would be other people thinking you are immoral and defecting against you; since this is a good strategy and a very common one, it seems likely that the ethicophysics can form a sort of Schelling Point whereby we select a more reasonable Nash Equilibrium than the unstable one at which we currently reside. The Golden Theorem is thus less a matter of theological woo (which the current draft has far too much of) and more a matter of good solid ethical praxis.
We discuss the significance and implications of this work. First, we dispatch an argument due to Adolf Hitler that karma can be expected to function automatically without strenuous ethical effort on our own part. (Said argument presumably having operated in German history rather like the Prosperity Gospel operates today in helping to convince the German people to actually perform the literal atrocities that made up the Holocaust.) Then we formulate a conjecture (which needs to be made far more explicit) about the potential existence of ethical momentum, and compare it to a humorous dialogue from the comic/webcomic Girl Genius. We then present a short (but probably not short enough) dialogue on the relation of the ethicophysics to AI safety and alignment as it is traditionally understood.
We conclude with an epilogue, a song by Cat Stevens / Yusuf Islam that we feel captures the emotional tone and cognitive content that we want the ethicophysics to have.
We give acknowledgments and references to prior work in the fields of ethics and physics.
I apologize for the constant invocation of Adolf Hitler, since this seems to run afoul of Godwin’s Law. The primary motivating question I was thinking about when developing the paper was, how would we tell Adolf Hitler and Jesus Christ apart if Omega told us that they had been uploaded to two superintelligent AI’s but didn’t tell us which one was which. I don’t know that we could tell them apart (superintelligence buys a lot), but I think that the overall dynamic would be much like the short story that @jessicata posted in the past day or so; by consulting both superintelligences carefully and in a balanced fashion, we would obtain better and wiser counsel than if we trusted either one of the superintelligences implicitly, or simply destroyed both of the superintelligences. This, in turn, would provide a solution to the alignment problem, since we can listen to the counsel of both superintelligences without fearing utter compromise by the rhetoric of a superintelligent Hitler; we could simply tune him out and pay more attention to the rhetoric of the superintelligent Jesus, or go for a walk and an ice cream and think about what the two superintelligences said.
Thanks, that helps. The text was tempting, but without a hint of how
might actually be achieved, the undertaking of reading it seemed too formidable.
With this comment it’s now feasible to try to understand the details.
It does make an interesting read so far (5 pages), a rewarding experience.
Reading this text does require some open-mindedness. For example a reader might firmly believe that to use the term “soul” is unacceptable, or a reader might firmly believe that the term “soul” means an entity having a “first-person experience”. So a reader needs to be ready to set this kind of firm beliefs aside temporarily (only while reading this paper) in order to “grok” this model.
So far, the only thought which occurred to me is that not only conventional love and hate, but also love and hate as defined in the text tend to dissipate with time, basically one that can’t store an accumulated positive or negative emotion without letting them dissipate. But for the purpose of this model we might nevertheless want to require those to be abstract quantities which don’t dissipate (this is what the reasoning about not currently existing actors seem to require). So “love” and “hate” defined in this paper seem to be corresponding to abstract meters counting what has been accumulated and not dissipating.
Yeah, it’s definitely a spherical cow in a frictionless vacuum kind of assumption.
If you (or any reader of this comment) want an exercise to work on while you read, you could try to add a treatment of emotional dissipation; presumably it would function more or less like air resistance in traditional physics; the mechanism would be that liking or disliking someone alive that you know with perfect and pure intensity is exhausting and not particularly evolutionarily valuable. Liking or disliking a fictional or dead person with perfect and pure intensity (say, Jesus and Hitler respectively) might be quite evolutionarily valuable, though, since it would enable the youth to take seriously the accumulated wisdom of their tribe. So perhaps dissipation only holds for the people with whom the minutiae of daily contact causes us to reevaluate our models, and fails to hold for those people (historical or fictional) whom we have ample Bayesian evidence to consign to either the Good or the Evil bin.
I am trying to think about the informal meaning of op(a,b) on page 8.
Am I correct that we impose a condition that l(a,b) and h(a,b) are always non-negative? And that their derivatives can’t be below 0, so that l(a,b) and h(a,b) are monotonically non-decreasing with time?
I thought about this some more, and I think you’re right that they should be monotonically non-decreasing with time. I was hesitant to bite that particular bullet because the subjective, phenomenological experience of hate and love is, of course, not monotonically non-decreasing. But it makes the equations work much better and everything is much simpler this way.
Ultimately, if one is in a loving marriage and then undergoes an ugly divorce, one winds up sort of not-caring about the other person, but it would be a mistake to say that your brain has erased all the accumulated love and hate you racked up. It just learns that it has more interesting things to do than to dwell on the past.
So I will add this to the next draft of Ethicophysics I. Let me know if you would like to be acknowledged or added as a co-author on that draft.
Yes, I think this is a very interesting feature of your formalism. These “love” and “hate” are “abstract counters”, their relationship with subjective feelings is complicated.
But this might be the correct way to capture this “ethicophysics” (it is a frequent situation in modern physics that there is some tension between naive intuition and the correct theory (starting with relativistic space-time and such)).
Let″s interact, try to work together a bit, think together, this might be quite fruitful :-) Perhaps, I’ll actually earn a co-authorship, we’ll see :-)
So, about the decomposition into positive and negative components which evolve monotonically:
So, basically, if one considers real numbers, one can define a strange non-Hausdorff topology on them, so that continuous transformations are monotocally non-decreasing functions, “continuous on the left”, and the open sets being open rays pointing upward. There is also a dual space with open sets being open rays pointing downward (I am thinking in terms of a vertical real line, with positive numbers above, and negative numbers below). They have quasi-metrics as distances, ReLU(y-x) and ReLU(x-y), so that going along one direction accumulates a usual distance on the meter, but going in the opposite direction accumulates zero (like a toll bridge charging toll only in one direction).
One of the most interesting mathematical structures in this sense comes from interval numbers, but there is a bit of twist to those interval numbers, one might want to even allow “partially contradictory interval numbers”, and then the math becomes more straightforward. It’s probably the best to share a few pages I scribbled on this 10 years ago: https://anhinga.github.io/brandeis-mirror/PartiallyInconsistentIntervalNumbers.pdf
(Eventually this ended up as a part of Section 4 of this “sandwich paper” (where Section 4 is the “math filling” of the sandwich): https://arxiv.org/abs/1512.04639)
I love this! It’s basically Dedekind cuts, right?
It is related in spirit, yes...
I think when Dana Scott was first doing this kind of “asymmetric topology” in late 1960-s/early 1970-s, in some of his constructions he did focus on the bases which were like rational numbers, and then it’s really similar in spirit...
(And when I started to work with his formalism in mid-1980-s and early 1990-s, I also focused on those bases, because it was easier to think that way, it was less abstract that way...)
Well, they would be represented in the brain by neurons, which have a natural ReLU function attached to them. I think they are always non-negative, but the derivative is unbounded, so they can saturate at 0 if someone grows totally uninterestsed in the question of whether they hate or love some particular entity.
Indeed.
This monotonicity (non-decrease in accumulated love and hate) is interesting (it resembles motifs from Scott topology used in denotational semantics).
And this decomposition into positive and negative components which evolve monotonically does resemble motifs in some of my math scribblings...
Indeed.
That would be a pretty non-trivial work, though, since dissipative physics is not Hamiltonian, so it is likely to require different techniques.
It would be hard to carry through in the theoretical domain, as you say, but in the empirical domain it should be relatively straightforward. Just rip off the math of air resistance, make some predictions, and do a gut check of what you get out.
Please link directly to the paper, rather than requiring readers to click their way through the substack post. Ideally, the link target would be on a more convenient site than academia.edu, which claims to require registration to read the content. (The content is available lower down, but the blocked “Download” buttons are confusing and misleading.)
https://github.com/epurdy/ethicophysics/blob/main/writeup1.pdf