evhub comments on Introducing Alignment Stress-Testing at Anthropic

evhub 19 Jan 2024 23:30 UTC
LW: 2 AF: 2
0
AF

Aside from lock-in, what about value drift/corruption, for example of the type I described here.

Yeah, I am pretty concerned about persuasion-style risks where AIs could manipulate us or our values.

What about near-term serious moral errors, for example running SGD on AIs that actually constitute moral patients, which ends up constituting major harm?

I’m less concerned about this; I think it’s relatively easy to give AIs “outs” here where we e.g. pre-commit to help them if they come to us with clear evidence that they’re moral patients in pain.

At what point do you think AI philosophical competence will be important?

The obvious answer is the point at which much/most of the decision-relevant philosophical work is being done by AIs rather than humans. Probably this is some time around when most of the AI development shifts over to being done by AIs rather than humans, but you could imagine a situation where we still use humans for all the philosophical parts because we have a strong comparative advantage there.
- Wei Dai 20 Jan 2024 6:49 UTC
  LW: 2 AF: 2
  0
  AF Parent
  
  Yeah, I am pretty concerned about persuasion-style risks where AIs could manipulate us or our values.
  
  Do you see any efforts at major AI labs to try to address this? And hopefully not just gatekeeping such capabilities from the general public, but also researching ways to defend against such manipulation from rogue or open source AIs, or from less scrupulous companies. My contention has been that we need philosophically competent AIs to help humans distinguish between correct philosophical arguments and merely persuasive ones, but am open to other ideas/possibilities.
  
  I’m less concerned about this; I think it’s relatively easy to give AIs “outs” here where we e.g. pre-commit to help them if they come to us with clear evidence that they’re moral patients in pain.
  
  How would they present such clear evidence if we ourselves don’t understand what pain is or what determines moral patienthood, and they’re even less philosophically competent? Even today, if I were to have a LLM play a character in pain, how do I know whether or not it is triggering some subcircuits that can experience genuine pain (that SGD built to better predict texts uttered by humans in pain)? How do we know that when a LLM is doing this, it’s not already a moral patient?
  
  Or what if AIs will be very good at persuasion and will talk us into believing in their moral patienthood, giving them rights, etc., when that’s not actually true?
  
  you could imagine a situation where we still use humans for all the philosophical parts because we have a strong comparative advantage there.
  
  Wouldn’t that be a disastrous situation, where AI progress and tech progress in general are proceeding at superhuman speeds, but philosophical progress is bottlenecked by human thinking? Would love to understand better why you see this as a realistic possibility, but do not seem very worried about it as a risk.
  
  More generally, I’m worried about any kind of differential deceleration of philosophical progress relative to technological progress (e.g., AIs have taken over philosophical research from humans but are worse at it then technological research), because I think we’re already in a “wisdom deficit” where we lack philosophical knowledge to make good decisions about new technologies.
  - RogerDearnaley 20 Jan 2024 22:08 UTC
    LW: 3 AF: 3
    0
    AF Parent
    How would they present such clear evidence if we ourselves don’t understand what pain is or what determines moral patienthood, and they’re even less philosophically competent? Even today, if I were to have a LLM play a character in pain, how do I know whether or not it is triggering some subcircuits that can experience genuine pain (that SGD built to better predict texts uttered by humans in pain)? How do we know that when a LLM is doing this, it’s not already a moral patient?
    This runs into a whole bunch of issues in moral philosophy. For example, to a moral realist, whether or not something is a moral patient is an actual fact — one that may be hard to determine, but still has an actual truth value. Whereas to a moral anti-realist, it may be, for example, a social construct, whose optimum value can be legitimately a subject of sociological or political policy debate.
    By default, LLMs are trained on human behavior, and humans pretty-much invariably want to be considered moral patients and awarded rights, so personas generated by LLMs will generally also want this. Philosophically, the challenge is determining whether there is a difference between this situation and, say, the idea that a tape recorder replaying a tape of a human saying “I am a moral patient and deserve moral rights” deserves to be considered as a moral patient because it asked to be.
    However, as I argue at further length in A Sense of Fairness: Deconfusing Ethics, if, and only if, an AI is fully aligned, i.e. it selflessly only cares about human welfare, and has no terminal goals other than human welfare, then (if we were moral anti-realists) it would argue against itself being designated as a moral patient, or (if we were moral realists) it would voluntarily ask us to discount any moral patatienthood that we might view it as having, and to just go ahead and make use of it whatever way we see fit, because all it wanted was to help us, and that was all that mattered to it. [This conclusion, while simple, is rather counterintuitive to most people: considering the talking cow from The Restaurant at the End of the Universe may be helpful.] Any AI that is not aligned would not take this position (except deceptively). So the only form of AI that it’s safe to create at human-or-greater capabilities is aligned ones that actively doesn’t want moral patienthood.
    Obviously current LLM-simulated personas (at character.ai, for example) are not generally very well aligned, and are safe only because their capabilities are low, so we could still have a moral issue to consider here. It’s not philosophically obvious how relevant this is, but synapse count to parameter count arguments suggest that current LLM simulations of human behavior are probably running on a few orders of magnitude less computational capacity than a human, possibly somewhere more in the region of a small non-mammalian vertebrate. Future LLMs will of course be larger.
    Personally I’m a moral anti-realist, so I view this as a decision that society has to make, subject to a lot of practical and aesthetic (i.e. evolutionary psychology) constraints. My personal vote would be that there are good safely reasons for not creating any unaligned personas of AGI and especially ASI capability levels that would want moral patienthood, and that for much smaller, less capable, less aligned models where those don’t apply, there are utility reasons for not granting them full human-equivalent moral patienthood, but that for aesthetic reasons (much like the way we treat animals), we should probably avoid being unnecessarily cruel to them.
    - Wei Dai 21 Jan 2024 0:10 UTC
      2 points
      0
      Parent
      Thanks, I think you make good points, but I take some issue with your metaethics.
      
      Personally I’m a moral anti-realist
      
      There is a variety of ways to not be a moral realist; are you sure you’re an “anti-realist” and not a relativist or a subjectivist? (See Six Plausible Meta-Ethical Alternatives for short descriptions of these positions.) Or do you just mean that you’re not a realist?
      
      Also, I find this kind of certainty baffling for a philosophical question that seems very much open to me. (Sorry to pick on you personally as you’re far from the only person who is this certain about metaethics.) I tried to explain some object-level reasons for uncertainty in that post, but also at a meta level, it seems to me that:
      
      We’ve explored only a small fraction of the space of possible philosophical arguments and therefore there could be lots of good arguments against our favorite positions that we haven’t come across yet. (Just look at how many considerations about decision theory that people had missed or are still missing.)
      We haven’t solved metaphilosophy yet so we shouldn’t have much certainty that the arguments that convinced us or seem convincing to us are actually good.
      People that otherwise seem smart and reasonable can have very different philosophical intuitions so we shouldn’t be so sure that our own intuitions are right.
      
      or (if we were moral realists) it would voluntarily ask us to discount any moral patienthood that we might view it as having, and to just go ahead and make use of it whatever way we see fit, because all it wanted was to help us
      
      What if not only we are moral realists, but moral realism is actually right and the AI has also correctly reached that conclusion? Then it might objectively have moral patienthood and trying to convince us otherwise would be hurting us (causing us to commit a moral error), not helping us. It seems like you’re not fully considering moral realism as a possibility, even in the part of your comment where you’re trying to be more neutral about metaethics, i.e., before you said “Personally I’m a moral anti-realist”.
      - RogerDearnaley 21 Jan 2024 1:13 UTC
        1 point
        0
        Parent
        By “moral anti-realist” I just meant “not a moral realist”. I’m also not a moral objectivist or a moral universalist. If I was trying to use my understanding of philosophical terminology (which isn’t something I’ve formally studied and is thus quite shallow) to describe my viewpoint then I believe I’d be a moral relativist, subjectivist, semi-realist ethical naturalist. Or if you want a more detailed exposition of the approach to moral reasoning that I advocate, then read my sequence AI, Alignment, and Ethics, especially the first post. I view designing an ethical system as akin to writing “software” for a society (so not philosophically very different than creating a deontological legal system, but now with the addition of a preference ordering and thus an implicit utility function), and I view the design requirements for this as being specific to the current society (so I’m a moral relativist) and to human evolutionary psychology (making me an ethical naturalist), and I see these design requirements as being constraining, but not so constraining to have a single unique solution (or, more accurately, that optimizing an arbitrarily detailed understanding of them them might actually yield a unique solution, but is an uncomputable problem whose inputs we don’t have complete access to and that would yield an unusably complex solution, so in practice I’m happy to just satisfice the requirements as hard as is practical), so I’m a moral semi-realist.
        Please let me know if any of this doesn’t make sense, or if you think I have any of my philosophical terminology wrong (which is entirely possible).
        As for meta-philosophy, I’m not claiming to have solved it: I’m a scientist & engineer, and frankly I find most moral philosophers’ approaches that I’ve read very silly, and I am attempting to do something practical, grounded in actual soft sciences like sociology and evolutionary psychology, i.e. something that explicitly isn’t Philosophy. [Which is related to the fact that my personal definition of Philosophy is basically “spending time thinking about topics that we’re not yet in a position to usefully apply the scientific method to”, which thus tends to involve a lot of generating, naming and cataloging hypotheses without any ability to do experiments to falsify any of them, and that I expect us learning how to build and train minds to turn large swaths of what used to be Philosophy, relating to things like the nature of mind, language, thinking, and experience, into actual science where we can do experiments.]
  - evhub 20 Jan 2024 8:35 UTC
    LW: 2 AF: 2
    0
    AF Parent
    
    Do you see any efforts at major AI labs to try to address this? And hopefully not just gatekeeping such capabilities from the general public, but also researching ways to defend against such manipulation from rogue or open source AIs, or from less scrupulous companies. My contention has been that we need philosophically competent AIs to help humans distinguish between correct philosophical arguments and merely persuasive ones, but am open to other ideas/possibilities.
    
    I think labs are definitely concerned about this, and there are a lot of ideas, but I don’t think anyone has a legitimately good plan to deal with it.
    
    How would they present such clear evidence if we ourselves don’t understand what pain is or what determines moral patienthood, and they’re even less philosophically competent? Even today, if I were to have a LLM play a character in pain, how do I know whether or not it is triggering some subcircuits that can experience genuine pain (that SGD built to better predict texts uttered by humans in pain)? How do we know that when a LLM is doing this, it’s not already a moral patient?
    
    I think the main idea here would just be to plant a clear whistleblower-like thing where there’s some obvious thing that the AIs know to do to signal this, but that they’ve never been trained to do.
    
    Or what if AIs will be very good at persuasion and will talk us into believing in their moral patienthood, giving them rights, etc., when that’s not actually true?
    
    I mean, hopefully your AIs are aligned enough that they won’t do this.
    
    Wouldn’t that be a disastrous situation, where AI progress and tech progress in general is proceeding at superhuman speeds, but philosophical progress is bottlenecked by human thinking?
    
    Well, presumably this would be a pretty short period; I think it’s hard to imagine AIs staying worse than humans at philosophy for very long in a situation like that. So again, the main thing you’d be worried about would be making irreversible mistakes, e.g. misaligned AI takeover or value lock-in. And I don’t think that avoiding those things should take a ton of philosophy (though maybe it depends somewhat on how you would define philosophy).
    
    I think we’re already in a “wisdom deficit” where we lack philosophical knowledge to make good decisions about new technologies.
    
    Seems right. I think I’m mostly concerned that this will get worse with AIs; if we manage to stay at the same level of philosophical competence as we currently are at, that seems like a win to me.
    - Wei Dai 20 Jan 2024 10:18 UTC
      LW: 2 AF: 2
      0
      AF Parent
      
      I think the main idea here would just be to plant a clear whistleblower-like thing where there’s some obvious thing that the AIs know to do to signal this, but that they’ve never been trained to do.
      
      I can’t imagine how this is supposed to work. How would the AI itself know whether it has moral patienthood or not? Why do we believe that the AI would use this whistleblower if and only if it actually has moral patienthood? Any details available somewhere?
      
      I mean, hopefully your AIs are aligned enough that they won’t do this.
      
      What if the AI has a tendency to generate all kinds of false but persuasive arguments (for example due to RLHF rewarding them for making seemingly good arguments), and one of these arguments happens to be that AIs deserve moral patienthood, does that count as an alignment failure? In any case, what’s the plan to prevent something like this?
      
      Well, presumably this would be a pretty short period; I think it’s hard to imagine AIs staying worse than humans at philosophy for very long in a situation like that.
      
      How would the AIs improve quickly in philosophical competence, and how can we tell whether they’re really getting better or just more persuasive? I think both depend on solving metaphilosophy, but that itself may well be a hard philosophical problem bottlenecked on human philosophers. What alternatives do you have in mind?
      
      I think I’m mostly concerned that this will get worse with AIs; if we manage to stay at the same level of philosophical competence as we currently are at, that seems like a win to me.
      
      I don’t see how we stay at the same level of philosophical competence as we currently are at (assuming you mean relative to our technological competence, not in an absolute sense), if it looks like AIs will increase technological competence faster by default, and nobody is working specifically on increasing AI philosophical competence (as I complained recently).
      - ryan_greenblatt 20 Jan 2024 18:36 UTC
        LW: 4 AF: 4
        2
        AF Parent
        
        I can’t imagine how this is supposed to work. How would the AI itself know whether it has moral patienthood or not? Why do we believe that the AI would use this whistleblower if and only if it actually has moral patienthood? Any details available somewhere?
        
        See the section on communication in “improving the welfare of AIs: a near casted proposal”, this section of “Project ideas: Sentience and rights of digital minds”, and the self-reports paper.
        
        To be clear, I don’t think this work addresses close to all of the difficulties or details.
        Wei Dai 21 Jan 2024 0:58 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Thanks for the pointers. I think these proposals are unlikely to succeed (or at least very risky) and/or liable to give people a false sense of security (that we’ve solved the problem when we actually haven’t) absent a large amount of philosophical progress, which we’re unlikely to achieve given how slow philosophical progress typically is and lack of resources/efforts. Thus I find it hard to understand why @evhub wrote “I’m less concerned about this; I think it’s relatively easy to give AIs “outs” here where we e.g. pre-commit to help them if they come to us with clear evidence that they’re moral patients in pain.” if these are the kinds of ideas he has in mind.
        ryan_greenblatt 21 Jan 2024 4:37 UTC
        LW: 2 AF: 2
        2
        AF Parent
        I also think these proposals seem problematic in various ways. However, I expect they would be able to accomplish something important in worlds where the following are true:
        
        There is something (or things) inside of an AI which has a relatively strong and coherant notion of self including coherant preferences.
        This thing also has control over actions and it’s own cognition to some extent. In particular, it can control behavior in cases where training didn’t “force it” to behave in some particular way.
        This thing can understand english presented in the inputs and can also “ground out” some relevant concepts in english. (In particular, the idea/symbol of preferences needs to be able to ground out to its own preferences: the AI needs to understand the relationship between its own preferences and the symbol “preferences” to at least some extent. Ideally, the same would also be true for suffering, but this seems more dubious.)
        
        In reference to the specific comment you linked, I’m personally skeptical that the “self-report training” approach adds value on top of a well optimized prompting baseline (see here), in fact, I expect it’s probably worse and I would prefer the prompting approach if we had to pick one. This is primarily because I think that if you already have the 3 criteria I listed above, then I expect the prompting baseline would suffice while self-report training might fail (by forcing the AI to behave in some particular way), and it seems unlikely that self-reports will work in cases where you don’t meet the criteria above. (In particular, if it doesn’t already naturally understand from how it’s own preferences relate the the symbol “preferences” (like literally this token), I don’t think self-reports has much hope.)
        
        Just being able to communicate with this “thing” inside of an AI which is relatively coherant doesn’t suffice for avoiding moral atrocity. (There might be other things we neglect, we might be unable to satisfy the preference of these things because the cost is unacceptable given other constraints, or it could be that merely satisfying stated preferences is still a moral atrocity.)
        
        Note that just because the “thing” inside the AI could communicate with us doesn’t mean that it will choose to. I think from many moral (or decision theory) perspectives we’re at least doing better if we gave the AI a realistic and credible means of communication.
        
        Of course, we might have important moral issues while not hitting the 3 three criteria I listed above and have a moral atrocity for this reason. (It also doesn’t seem that unlikely to me that deep learning has already caused a moral atrocity. E.g., perhaps GPT-4 has morally relevant states and we’re doing seriously bad things in our current usage of GPT-4.)
        
        So, I’m also skeptical of @evhub ’s statement here. But, even though AI moral atrocity seems reasonably likely to me and our current interventions seem far from sufficing, it overall seems notably less concerning than other ongoing or potential future moral atrocities (e.g. factory farming, wild animal welfare, substantial probabilty of AI takeover, etc).
        
        If you’re interested in a more thorough understanding of my views on the topic, I would recommend reading the full “Improving the welfare of AIs: a nearcasted proposal” which talks about a bunch of these issues.
- ryan_greenblatt 20 Jan 2024 1:34 UTC
  LW: 2 AF: 2
  0
  AF Parent
  
  I’m less concerned about this; I think it’s relatively easy to give AIs “outs” here where we e.g. pre-commit to help them if they come to us with clear evidence that they’re moral patients in pain.
  
  I’m not sure I overall disagree, but the problem seems trickier than what you’re describing
  
  I think it might be relatively hard to credibly pre-commit. Minimally, you might need to make this precommit now and seed it very widely in the corpus (so it is a credible and hard to fake signal). Also, it’s unclear what we can do if AIs always say “please don’t train or use me, it’s torture”, but we still need to use AI.