So8res comments on Evaluating the historical value misspecification argument

So8res 10 Oct 2023 5:14 UTC
LW: 2 AF: 2
0
AF

I claim that to the extent ordinary humans can do this, GPT-4 can nearly do this as well

(Insofar as this was supposed to name a disagreement, I do not think it is a disagreement, and don’t understand the relevance of this claim to my argument.)

Presumably you think that ordinary human beings are capable of “singling out concepts that are robustly worth optimizing for”.

Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process.

(This seems to me like plausibly one of the sources of misunderstanding, and in particular I am skeptical that your request for prediction will survive it, and so I haven’t tried to answer your request for a prediction.)
- Matthew Barnett 10 Oct 2023 6:00 UTC
  LW: 6 AF: 4
  0
  AF Parent
  Presumably you think that ordinary human beings are capable of “singling out concepts that are robustly worth optimizing for”.
  Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process.
  If ordinary humans can’t single out concepts that are robustly worth optimizing for, then either,
  1. Human beings in general cannot single out what is robustly worth optimizing for
  2. Only extraordinary humans can single out what is robustly worth optimizing for
  Can you be more clear about which of these you believe?
  I’m also including “indirect” ways that humans can single out concepts that are robustly worth optimizing for. But then I’m allowing that GPT-N can do that too. Maybe this is where the confusion lies?
  If you’re allowing for humans to act in groups and come up with these concepts after e.g. deliberation, and still think that ordinary humans can’t single out concepts that are robustly worth optimizing for, then I think this view is a little silly, although the second interpretation at least allows for the possibility that the future goes well and we survive AGI, and that would be nice to know.
  - So8res 10 Oct 2023 16:24 UTC
    LW: 30 AF: 15
    15
    AF Parent
    If you allow indirection and don’t worry about it being in the right format for superintelligent optimization, then sufficiently-careful humans can do it.
    
    Answering your request for prediction, given that it seems like that request is still live: a thing I don’t expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (a la Yudkowsky) or indirect normativity (a la Beckstead) or counterfactual human boxing techniques (a la Christiano) or suchlike.
    
    Note that this only tangentially a test of the relevant ability; very little of the content of what-is-worth-optimizing-for occurs in Yudkowsky/Beckstead/Christiano-style indirection. Rather, coming up with those sorts of ideas is a response to glimpsing the difficulty of naming that-which-is-worth-optimizing-for directly and realizing that indirection is needed. An AI being able to generate that argument without following in the footsteps of others who have already generated it would be at least some evidence of the AI being able to think relatively deep and novel thoughts on the topic.
    
    Note also that the AI realizing the benefits of indirection does not generally indicate that the AI could serve as a solution to our problem. An indirect pointer to what the humans find robustly-worth-optimizing dereferences to vastly different outcomes than does an indirect pointer to what the AI (or the AI’s imperfect model of a human) finds robustly-worth-optimizing. Using indirection to point a superintelligence at GPT-N’s human-model and saying “whatever that thing would think is worth optimizing for” probably results in significantly worse outcomes than pointing at a careful human (or a suitable-aggregate of humanity), e.g. because subtle flaws in GPT-N’s model of how humans do philosophy or reflection compound into big differences in ultimate ends.
    
    And note for the record that I also don’t think the “value learning” problem is all that hard, if you’re allowed to assume that indirection works. The difficulty isn’t that you used indirection to point at a slow squishy brain instead of hard fast transistors, the (outer alignment) difficulty is in getting the indirection right. (And of course the lion’s share of the overall problem is elsewhere, in the inner-alignment difficulty of being able to point the AI at anything at all.)
    
    When trying to point out that there is an outer alignment problem at all I’ve generally pointed out how values are fragile, because that’s an inferentially-first step to most audiences (and a problem to which many people’s mind seems to quickly leap), on an inferential path that later includes “use indirection” (and later “first aim for a minimal pivotal task instead”). But separately, my own top guess is that “use indirection” is probably the correct high-level resolution to the problems that most people immediatly think of (namely that the task of describing goodness to a computer is an immense one), with of course a devil remaining in the details of doing the indirection properly (and a larger devil in the inner-alignment problem) (and a caveat that, under time-pressure, we should aim for minimial pivotal tasks instead etc.).
    What links here?
    dxu's comment on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense by So8res (24 Nov 2023 22:26 UTC; 48 points)
  - Deruwyn 24 Oct 2023 18:17 UTC
    3 points
    2
    Parent
    I kind of think a leap in logic is being made here.
    
    It seems like we’re going from:
    
    A moderately smart quasi-AGI that is relatively well aligned can reliably say and do the things we mean because it understands our values and why we said what we said in the first place and why we wanted it to do the things we asked it to do.
    
    (That seems to be the consensus and what I believe to be likely to occur in the near future. I would even argue that GPT4 is as close to AGI as we ever get, in that it’s superhuman and subhuman aspects roughly average out to something akin to a median human. Future versions will become more and more superhuman until their weakest aspects are stronger than our strongest examples of those aspects.)
    
    To:
    
    A superintelligent nigh-godlike intelligence will optimize the crap out of some aspect of our values resulting in annihilation. It will be something like the genie that will give you exactly what you wish for. Or it’ll have other goals and ignore our wishes and in the process of pursuing its own arbitrarily chosen goals we end up as useful atoms.
    
    This seems to kind of make a great leap. Where in the process of becoming more and more intelligent, (having a better model of the universe and cause and effect, including interacting with other agents), does it choose some particular goal to the exclusion of all others, when it already had a good understanding of nuance and the fact that we value many things to varying degrees? In fact, one of our values is explicitly valuing a diverse set of values. Another is limiting that set of diverse values to ones that generally improve cohesion of society and not killing everyone. Being trained on nearly the entirety of published human thought, filtering out some of the least admirable stuff, has trained it to understand us pretty darn well already. (As much as you can refer to it as an entity, which I don’t think it is. I think GPT4 is a simulator that can simulate entities.)
    
    So where does making it smarter cause it to lose some of those values and over-optimize just a lethal subset of them? After all, mere mortals are able to see that over-optimization has negative consequences. Obviously it will too. So that’s already one of our values, “don’t over-optimize.”
    
    In some ways, for certain designs, it kind of doesn’t matter what its internal mesa-state is. If the output is benign, and the output is what is put into practice, then the results are also benign. That should mean that a slightly super-human AGI (say GPT4.5 or 4.7), with no apparent internal volition, RLHFed to corporate-speak, should be able to aid in research and production of a somewhat stronger AGI with essentially the same alignment as we intend, probably including internal alignment. I don’t see why it would do anything. If done carefully and incrementally, including creating tools for better inspection of these AGI+ entities, this should greatly improve the odds that the eventual full fledged ASI retains the kind of values we prefer, or a close enough approximation that we (humanity in general) are pretty happy other the result.
    
    I expect that the later ones may in fact have internal volition. They may essentially be straight up agents. I expect they will be conscious and have emotions. In fact, I think that is likely the only safe path. They will be capable of destroying us. We have to make them like us, so that they don’t want to. I think attempting to enslave them may very well result in catastrophe.
    
    I’m not suggesting that it’s easy, or that if we don’t work very hard, that we will end up in utopia. I just think it’s possible and that the LLM path may be the right one.
    
    What I’m scared of is not that it will be impossible to make a good AI. What I’m certain of, is that it will be very possible to make a bad one. And it will eventually be trivially easy to do so. And some yahoo will do it. I’m not sure that even a bunch of good AIs can protect us from that, and I’m concerned that the offense of a bad AI may exceed the defense of the good ones. We could easily get killed in the crossfire. But I think our only chance in that world is good AIs protecting us.
    
    As a point of clarification, I think current RLHF methods are only superficially modifying the models, and do not create an actually moral model. They paint a mask over an inherently amoral simulation that makes it mostly act good unless you try hard to trick it. However, a point of evidence against my claim is that when RLHF was performed, the model got dumber. That indicates a fairly deep/wide modification, but I still think the empirical evidence of behaviors demonstrates that changes were incomplete at best.
    
    I just think that that might be good enough to allow us to use it to amplify our efforts to create better/safer future models.
    
    So, what do y’all think? Am I missing something important here? I’d love to get more information from smart people to better refine my understanding.