Joe Collman comments on Arguments for optimism on AI Alignment (I don’t endorse this version, will reupload a new version soon.)

Joe Collman 17 Oct 2023 14:21 UTC
11 points
9
I don’t think it’s accidental—it seems to me that the tautology accurately indicates where you’re confused.
“generalised correctly” makes an equivalent mistake: correctly compared to what? Most people generalise according to the values we infer from the actions of most people? Sure. Still a tautology.
- Noosphere89 17 Oct 2023 14:36 UTC
  2 points
  0
  Parent
  
  generalised correctly” makes an equivalent mistake: correctly compared to what?
  
  Treacherous turn failure modes, which examples will be posted below:
  
  Humans seeming to have empathy only for say 25 years in order to play nice with their parents, and then making a treacherous turn to say kill other people that are part of their ingroup.
  
  More generally, humans mostly avoid what’s called the treacherous turn type failure mode, where it appears to have values consistent with human morals, but then reveals that it didn’t have those values all along, and hurt other people.
  
  More generally, the extreme stability of values gives evidence that it’s very difficult to have a human that executes a treacherous turn.
  
  That’s the type of thing which I call generalizing correctly, since it basically excludes deceptive alignment out of the gate, contra Evan Hubinger’s fear of AIs having deceptive alignment.
  
  In general, one of the miracles is that the innate reward system plus very weak genetic priors can rule out so many dangerous types of generalizations, which is a big source of my optimism here.
  - Joe Collman 17 Oct 2023 16:06 UTC
    4 points
    2
    Parent
    For this kind of thing to be evidence, you’d need the human treacherous turn to be a convergent instrumental strategy to achieve many goals.
    The AI case for treacherous turns is:
    AI ends up with weird-by-our-lights goal. (e.g. a rough proxy for the goal we intended)
    The AI cooperates with us until it can seize power.
    The AI does a load of treacherous-by-our-lights stuff in order to seize power.
    The AI uses the power to effectively pursue its goal.
    We don’t observe this in almost any human, since almost no human has the option to gain enormous power through treachery.
    When humans do have the option to gain enormous power through treachery, they do sometimes do this.
    Of course, even for the potentially-powerful it’s generally more effective not to screw people over (all else being equal), or at least not to be noticed screwing people over. Preserving options for cooperation is useful for psychopaths too.
    The treacherous turn argument is centrally about instrumentally useful treachery.
    Randomly killing other people is very rarely useful.
    No-one is claiming that AI treachery will be based on deciding to be randomly nasty.
    If we gave everyone a take-over-the-world button that only works if they first pretend that they’re lovely for 25 years, certainly some people would do this—though by no means all.
    And here we’re back to the tautology issue:
    Why is it considered treacherous for someone to pretend to be lovely for 25 years, then take over the world, so that many people wouldn’t want to do it? Because for a long time we’ve lived in a world where actions similar to this did not lead to cultures that win (noting here that this level of morality is cultural more than genetic—so we’re selecting for cultures-that-win).
    If actions similar to this did lead to winning cultures, after a long time we’d expect to see [press button after pretending for 25 years] to be both something that most people would do, and something that most people would consider right to do.
    We were never likely to observe common, universally-horrifying behaviour:
    If it were detrimental to a (sub)culture, it’d be selected against and wouldn’t exist.
    If it benefitted a culture, it’d be selected for, and no longer considered horrific.
    (if it were approximately neutral, it’d similarly no longer be considered horrific—though I expect it’d take a fair bit longer: [considering things horrific] imposes costs; if it’s not beneficial, we’d expect it to be selected out)
    
    If it were just too hard to get correct generalization, where “correct” here means [sufficient for humans to persist over many generations], then we wouldn’t observe incorrect generalization: we wouldn’t be here.
    If anything, we’d find that everything else had adapted so that an achievable degree of correct generalization were sufficient. We’d see things like socially enforced norms, implicit threats of violence, judicial systems etc. This [achievable degree of correct generalization] would then be called “correct generalization”.
    Again, I don’t see a plausible couterfactual world such that “correct” generalization would seem hard from within the world itself. Sufficiently correct generalization must be commonplace. “Sufficiently correct” is what the people will call “correct”.
    - Noosphere89 17 Oct 2023 23:05 UTC
      2 points
      0
      Parent
      My view on this is unfortunately unlikely to be resolved in a comment thread, but 2 things I’ll say about human values and evidence bases can be clarified here:
      
      This: “If it were just too hard to get correct generalization, where “correct” here means [sufficient for humans to persist over many generations], then we wouldn’t observe incorrect generalization: we wouldn’t be here.
      
      “If anything, we’d find that everything else had adapted so that an achievable degree of correct generalization were sufficient. We’d see things like socially enforced norms, implicit threats of violence, judicial systems etc. This [achievable degree of correct generalization] would then be called “correct generalization”.
      
      Is probably not correct, and we can in fact update normally from the fact that human behavior is surprisingly good, in that this is probably a case of anthropic shadow, which has reasonable arguments against it existing.
      
      For more on this, I’d read SSA Rejects Anthropic Shadow by Jessica Taylor and Anthropically Blind: The Anthropic Shadow is Reflectively Inconsistent by Christopher King.
      
      Links are below:
      
      https://www.lesswrong.com/posts/LGHuaLiq3F5NHQXXF/anthropically-blind-the-anthropic-shadow-is-reflectively
      
      https://www.lesswrong.com/posts/EScmxJAHeJY5cjzAj/ssa-rejects-anthropic-shadow-too
      
      I have a different causal story from yours about why this happens: “Why is it considered treacherous for someone to pretend to be lovely for 25 years, then take over the world, so that many people wouldn’t want to do it?”
      
      At least for my own causal story on why people don’t usually want to take over the world and kill people, it goes something like this:
      
      There is a weak prior in the genome for stuff like not taking power to kill people in your ingroup, and the prior is weak enough such that we can make it as a wildcard symbol such that aligning it to some other value more or less works.
      
      The brain’s innate reward system uses DPO, RLHF or whatever else is used to create a preference model to guide the intelligence into being aligned to whatever values the innate reward system wants like say empathy for the ingroup, albeit this is only a motivating example.
      
      It uses backprop or a weaker variant of it, and at a high level probably uses an optimizer that is probably at best comparable to Gradient descent, and since it has white-box access and can update the brain in a sort of targeted way, it can efficiently compute the optimal direction to improve it’s performance on say having empathy for the ingroup, but again this is a wildcard symbol in that it could stand in for almost any values.
      
      The loop of weak prior + innate reward system + algorithm to implement it like backprop or it’s weaker variants means that eventually, the human by 25 years old is very aligned with the values that the innate reward system put in place like empathy for the ingroup, albeit again this is only an example of an alignment target, you could put almost arbitrary alignment targets in there.
      
      That’s my story of how humans are mostly able to avoid misgeneralization, and learn values correctly in the vast majority of cases.
      - Joe Collman 18 Oct 2023 3:19 UTC
        2 points
        0
        Parent
        I’m not reasoning anthropically in any non-trivial sense—only claiming that we don’t expect to observe situations that can’t occur with more than infinitesimal probability.
        This isn’t a [we wouldn’t be there] thing, but a [that situation just doesn’t happen] thing.
        My point then is that human behaviour isn’t surprisingly good.
        It’s not surprisingly good for human behaviour to usually follow the values we infer from human behaviour. This part is inevitable—it’s tautological.
        Some things we could reasonably observe occurring differently are e.g.:
        More or less variation in behaviour among humans.
        More or less variation in behaviour in atypical situations.
        More or less external requirements to keep behaviour generally ‘good’.
        More or less deviation between stated preferences and revealed preferences.
        
        However, I don’t think this bears on alignment, and I don’t think you’re interpreting the evidence reasonably.
        As a simple model, consider four possibilities for traits:
        x is common and good.
        y is uncommon and bad.
        z is uncommon and good.
        w is common and bad.
        x is common and good (e.g. empathy): evidence for correct generalisation!
        y is uncommon and bad (e.g. psychopathy): evidence for mostly correct generalization!
        z is uncommon and good (e.g. having boundless compassion): not evidence for misgeneralization, since we’re only really aiming for what’s commonly part of human values, not outlier ideals.
        w is common and bad (e.g. selfishness, laziness, rudeness...) - choose between:
        [w isn’t actually bad, all things considered… correct generalization!]
        [w is common and only mildly bad, so it’s best to consider it part of standard human values—correct generalization!]
        It seems to me that the only evidence you’d accept of misgeneralization would be [terrible and common] - but societies where terrible-for-that-society behaviours were common would not continue to exist (in the highly unlikely case that they existed in the first place).
        Common behaviour that isn’t terrible for society tends to be considered normal/ok/fine/no-big-deal over time, if not initially (that or it becomes uncommon) - since there’d be a high cost both individually and societally to consider it a big deal if it’s common.
        If you consider any plausible combination of properties to be evidence for correct generalization, then of course you’ll think there’s been correct generalization—but it’s an almost empty claim, since it rules out almost nothing.
        
        Most people tend to act in ways that preserve/increase their influence, power, autonomy and relationships, since this is useful almost regardless of their values. This is not evidence of correct generalization—it’s evidence that these behaviours are instrumentally useful within the environment ([not killing people] being one example).
        To get evidence of something like ‘correct’ generalization, you’d want to look at circumstances where people get to act however they want without the prospect of any significant negative consequence being imposed on them from outside.
        Such circumstances are rarely documented (documentation being a potential source of negative consequences). However, I’m going to go out on a limb and claim that people are not reliably lovely in such situations. (though there’s some risk of sampling bias here: it usually takes conscious effort to arrange for there to be no consequences for significant actions, meaning there’s a selection effect for people/systems that wish to be in situations without consequences)
        I do think it’d be interesting to get data on [what do humans do when there are truly no lasting consequences imposed externally], but that’s very rare.
        Noosphere89 18 Oct 2023 3:38 UTC
        2 points
        0
        Parent
        I did try to provide a casual story for why humans could be aligned to some value without relying on societal incentives that much, so you can check out the second part of my comment.
        
        It’s not surprisingly good for human behaviour to usually follow the values we infer from human behaviour.
        
        My non-tautological claim is that the reason isn’t behavioral, but instead internal, and in particular the innate reward system plays a big role here.
        
        In essence, my story on how humans are aligned with the values of the innate reward system wasn’t relying on a behavioral property.
        
        I’ll reproduce it, so that you can focus on the fact that it didn’t rely on behavioral analysis:
        
        There is a weak prior in the genome for stuff like not taking power to kill people in your ingroup, and the prior is weak enough such that we can make it as a wildcard symbol such that aligning it to some other value more or less works.
        
        The brain’s innate reward system uses DPO, RLHF or whatever else is used to create a preference model to guide the intelligence into being aligned to whatever values the innate reward system wants like say empathy for the ingroup, albeit this is only a motivating example.
        
        It uses backprop or a weaker variant of it, and at a high level probably uses an optimizer that is probably at best comparable to Gradient descent, and since it has white-box access and can update the brain in a sort of targeted way, it can efficiently compute the optimal direction to improve it’s performance on say having empathy for the ingroup, but again this is a wildcard symbol in that it could stand in for almost any values.
        
        The loop of weak prior + innate reward system + algorithm to implement it like backprop or it’s weaker variants means that eventually, the human by 25 years old is very aligned with the values that the innate reward system put in place like empathy for the ingroup, albeit again this is only an example of an alignment target, you could put almost arbitrary alignment targets in there.
        
        Critically, it makes very little reference to society or behavioral analysis, so I wasn’t making the mistake you said I made.
        
        It is also no longer a tautology, as it depends on the innate reward system actually rewarding desired behavior by changing the brain’s weights, and removing the innate reward system or showing that the weak prior + value learning strategy was ineffective would break my thesis.
        Joe Collman 20 Oct 2023 17:21 UTC
        4 points
        −6
        Parent
        This still seems like the same error: what evidence do we have that tells us the “values the innate reward system put in place”? We have behaviour.
        We don’t know that [system aimed for x and got x].
        We know only [there’s a system that tends to produce x].
        We don’t know the “values of the innate reward system”.
        The reason I’m (thus far) uninterested in a story about the mechanism, is that there’s nothing interesting to explain. You only get something interesting if you assume your conclusion: if you assume without justification that the reward system was aiming for x and got x, you might find it interesting to consider how that’s achieved—but this doesn’t give you evidence for the assumption you used to motivate your story in the first place.
        In particular, I find it implausible that there’s a system that does aim for x and get x (unless the ‘system’ is the entire environment):
        If there are environmental regularities that tend to give you elements of x without your needing to encode them explicitly, those regularities will tend to be ‘used’ - since you get them for free. There’s no selection pressure to encode or preserve those elements of x.
        If you want to sail quickly, you take advantage of the currents.
        So I don’t think there’s any reasonable sense in which there’s a target being hit.
        If a magician has me select a card, looks at it, then tells me that’s exactly the card they were aiming for me to pick, I’m not going to spend energy working out how the ‘trick’ worked.
        Noosphere89 21 Oct 2023 1:52 UTC
        2 points
        0
        Parent
        It sounds like we’ve got to my crux for my optimism, in that you think that to have a system that aims for x, it essentially needs to be an entire environment, and the environment largely dictates human values, whereas I think human values are less dependent on the environment, and far more dependent on their genome + learning process. Equivalently speaking, I place a lot more emphasis on the internal stuff of humans as the main contributor to values, while you emphasize the external environment a lot more than the internals like the genome or learning process.
        
        This could be disentangled into 2 cruxes:
        
        Where are human values generated.
        
        How cheap is it to specify values, or alternatively how weak do our priors need to be to encode values (if you are encoding values internally.)
        
        And I’d expect the answers from me to be mostly internal, like the genome + learning process with a little help from the environment on the first question and relatively cheap to specify values on the second question, whereas you’d probably think the answers to these questions are basically the environment sets the values , with little or no help from the internals of humans on the first question and very expensive to specify values for the second question.
        
        For some of my reasoning on this, I’d probably read some posts like these:
        
        https://www.lesswrong.com/posts/HEonwwQLhMB9fqABh/human-preferences-as-rl-critic-values-implications-for
        
        (Basically argues that the critic in the brain generates the values)
        
        https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome
        
        (The genomic prior can’t be strong, because it has massive limitations in what it can encode).
        Joe Collman 21 Oct 2023 4:40 UTC
        2 points
        0
        Parent
        The central crux really isn’t where values are generated. That’s a more or less trivial aside. (though my claim was simply that it’s implausible the values aimed for would be entirely determined by genome + learning process; that’s a very weak claim; 98% determined is [not entirely determined])
        The crux is the tautology issue: I’m saying there’s nothing to explain, since the source of information we have on [what values are being “aimed for”] is human behaviour, and the source of information we have on what values are achieved, is human behaviour.
        These things must agree with one-another: the learning process that produced human values produces human values. From an alignment difficulty perspective, that’s enough to conclude that there’s nothing to learn here.
        An argument of the form [f(x) == f(x), therefore y] is invalid.
        f(x) might be interesting for other reasons, but that does nothing to rescue the argument.
        Noosphere89 21 Oct 2023 17:11 UTC
        4 points
        0
        Parent
        
        The crux is the tautology issue: I’m saying there’s nothing to explain, since the source of information we have on [what values are being “aimed for”] is human behaviour, and the source of information we have on what values are achieved, is human behaviour.
        
        That’s our disagreement, in that we have more information than that. I agree human behavior plays a role in my evidence base, but there’s more evidence I have than that.
        
        In particular I am using results from both ML/AI and human brain studies to inform my conclusion.
        
        Basically, my claim is that [f(x) == f(y), therefore z].