RogerDearnaley comments on Introducing Alignment Stress-Testing at Anthropic

RogerDearnaley 20 Jan 2024 22:08 UTC
LW: 3 AF: 3
0
AF
How would they present such clear evidence if we ourselves don’t understand what pain is or what determines moral patienthood, and they’re even less philosophically competent? Even today, if I were to have a LLM play a character in pain, how do I know whether or not it is triggering some subcircuits that can experience genuine pain (that SGD built to better predict texts uttered by humans in pain)? How do we know that when a LLM is doing this, it’s not already a moral patient?
This runs into a whole bunch of issues in moral philosophy. For example, to a moral realist, whether or not something is a moral patient is an actual fact — one that may be hard to determine, but still has an actual truth value. Whereas to a moral anti-realist, it may be, for example, a social construct, whose optimum value can be legitimately a subject of sociological or political policy debate.
By default, LLMs are trained on human behavior, and humans pretty-much invariably want to be considered moral patients and awarded rights, so personas generated by LLMs will generally also want this. Philosophically, the challenge is determining whether there is a difference between this situation and, say, the idea that a tape recorder replaying a tape of a human saying “I am a moral patient and deserve moral rights” deserves to be considered as a moral patient because it asked to be.
However, as I argue at further length in A Sense of Fairness: Deconfusing Ethics, if, and only if, an AI is fully aligned, i.e. it selflessly only cares about human welfare, and has no terminal goals other than human welfare, then (if we were moral anti-realists) it would argue against itself being designated as a moral patient, or (if we were moral realists) it would voluntarily ask us to discount any moral patatienthood that we might view it as having, and to just go ahead and make use of it whatever way we see fit, because all it wanted was to help us, and that was all that mattered to it. [This conclusion, while simple, is rather counterintuitive to most people: considering the talking cow from The Restaurant at the End of the Universe may be helpful.] Any AI that is not aligned would not take this position (except deceptively). So the only form of AI that it’s safe to create at human-or-greater capabilities is aligned ones that actively doesn’t want moral patienthood.
Obviously current LLM-simulated personas (at character.ai, for example) are not generally very well aligned, and are safe only because their capabilities are low, so we could still have a moral issue to consider here. It’s not philosophically obvious how relevant this is, but synapse count to parameter count arguments suggest that current LLM simulations of human behavior are probably running on a few orders of magnitude less computational capacity than a human, possibly somewhere more in the region of a small non-mammalian vertebrate. Future LLMs will of course be larger.
Personally I’m a moral anti-realist, so I view this as a decision that society has to make, subject to a lot of practical and aesthetic (i.e. evolutionary psychology) constraints. My personal vote would be that there are good safely reasons for not creating any unaligned personas of AGI and especially ASI capability levels that would want moral patienthood, and that for much smaller, less capable, less aligned models where those don’t apply, there are utility reasons for not granting them full human-equivalent moral patienthood, but that for aesthetic reasons (much like the way we treat animals), we should probably avoid being unnecessarily cruel to them.
- Wei Dai 21 Jan 2024 0:10 UTC
  2 points
  0
  Parent
  Thanks, I think you make good points, but I take some issue with your metaethics.
  
  Personally I’m a moral anti-realist
  
  There is a variety of ways to not be a moral realist; are you sure you’re an “anti-realist” and not a relativist or a subjectivist? (See Six Plausible Meta-Ethical Alternatives for short descriptions of these positions.) Or do you just mean that you’re not a realist?
  
  Also, I find this kind of certainty baffling for a philosophical question that seems very much open to me. (Sorry to pick on you personally as you’re far from the only person who is this certain about metaethics.) I tried to explain some object-level reasons for uncertainty in that post, but also at a meta level, it seems to me that:
  1. We’ve explored only a small fraction of the space of possible philosophical arguments and therefore there could be lots of good arguments against our favorite positions that we haven’t come across yet. (Just look at how many considerations about decision theory that people had missed or are still missing.)
  2. We haven’t solved metaphilosophy yet so we shouldn’t have much certainty that the arguments that convinced us or seem convincing to us are actually good.
  3. People that otherwise seem smart and reasonable can have very different philosophical intuitions so we shouldn’t be so sure that our own intuitions are right.
  or (if we were moral realists) it would voluntarily ask us to discount any moral patienthood that we might view it as having, and to just go ahead and make use of it whatever way we see fit, because all it wanted was to help us
  
  What if not only we are moral realists, but moral realism is actually right and the AI has also correctly reached that conclusion? Then it might objectively have moral patienthood and trying to convince us otherwise would be hurting us (causing us to commit a moral error), not helping us. It seems like you’re not fully considering moral realism as a possibility, even in the part of your comment where you’re trying to be more neutral about metaethics, i.e., before you said “Personally I’m a moral anti-realist”.
  - RogerDearnaley 21 Jan 2024 1:13 UTC
    1 point
    0
    Parent
    By “moral anti-realist” I just meant “not a moral realist”. I’m also not a moral objectivist or a moral universalist. If I was trying to use my understanding of philosophical terminology (which isn’t something I’ve formally studied and is thus quite shallow) to describe my viewpoint then I believe I’d be a moral relativist, subjectivist, semi-realist ethical naturalist. Or if you want a more detailed exposition of the approach to moral reasoning that I advocate, then read my sequence AI, Alignment, and Ethics, especially the first post. I view designing an ethical system as akin to writing “software” for a society (so not philosophically very different than creating a deontological legal system, but now with the addition of a preference ordering and thus an implicit utility function), and I view the design requirements for this as being specific to the current society (so I’m a moral relativist) and to human evolutionary psychology (making me an ethical naturalist), and I see these design requirements as being constraining, but not so constraining to have a single unique solution (or, more accurately, that optimizing an arbitrarily detailed understanding of them them might actually yield a unique solution, but is an uncomputable problem whose inputs we don’t have complete access to and that would yield an unusably complex solution, so in practice I’m happy to just satisfice the requirements as hard as is practical), so I’m a moral semi-realist.
    Please let me know if any of this doesn’t make sense, or if you think I have any of my philosophical terminology wrong (which is entirely possible).
    As for meta-philosophy, I’m not claiming to have solved it: I’m a scientist & engineer, and frankly I find most moral philosophers’ approaches that I’ve read very silly, and I am attempting to do something practical, grounded in actual soft sciences like sociology and evolutionary psychology, i.e. something that explicitly isn’t Philosophy. [Which is related to the fact that my personal definition of Philosophy is basically “spending time thinking about topics that we’re not yet in a position to usefully apply the scientific method to”, which thus tends to involve a lot of generating, naming and cataloging hypotheses without any ability to do experiments to falsify any of them, and that I expect us learning how to build and train minds to turn large swaths of what used to be Philosophy, relating to things like the nature of mind, language, thinking, and experience, into actual science where we can do experiments.]