RogerDearnaley comments on A case for AI alignment being difficult

RogerDearnaley 2 Jan 2024 7:33 UTC
4 points
1
Current ML systems, like LLMs, probably possess primitive agency at best
Current LLM systems simulate human token generation processes (with some level of fidelity). They thus have approximately the same level of agency as humans (slightly reduced by simulation errors), up until the end of the context window. I would definitely describe humans as having more than “primitive agency at best”.
To address some of your later speculations. Humans are obviously partly consequentialist (for planning) and partly deontological (mostly only for morality). They of course model the real world, and care about it. Human-like agents simulated by LLMs should be expected to, and can be observed to, do these things too (except that they can only get information about the current state of the real world via the prompt or any tools we give them).
Ontological identification is absolutely not a problem with LLMs: they have read a trillion+ token of tokens of our ontologies and are very familiar with them (arguable more familiar with them than any of us are). Try quizzing GPT-4 if you need convincing. They also understand theory of mind just fine, and indexicality.
I am a little puzzled why you are still trying to figure out alignment difficulty for abstract AIs with abstract properties, and pointing out that “maybe X will be difficult, or Y, or Z” when we have had LLMs for years, and they do X, Y, and Z just fine. Meanwhile, LLMs have a bunch of alignment challenges (such as the fact that the inherit all of humans bad behaviors, such as deceit, greed, drive for power, vanity, lust, etc etc) from learning to simulate us, which you don’t mention. There are a lot of pressing concerns about how hard an LLM-powered AI would be to align, and most of them are described by the evolutionary psychology of humans, as adaption-executing misgeneralizing mesaoptimiszers of evolution.
- jessicata 2 Jan 2024 8:08 UTC
  4 points
  0
  Parent
  They would approximate human agency at the limit but there’s both the issue of how fast they approach the limit and the degree to which they have independent agency rather than replicating human agency. There are fewer deceptive alignment problems if the long term agency they have is just an approximation of human agency.
  
  Mostly I don’t think there’s much of an alignment problem for LLMs because they basically approximate human-like agency, but they aren’t approaching autopoiesis, they’ll lead to some state transition that is kind of like human enhancement and kind of like invention of new tools. There are eventually capability gains by modeling things using a different, better set of concepts and agent substrate than humans have, it’s just that the best current methods heavily rely on human concepts.
  
  I don’t understand what you think the pressing concerns with LLM alignment are. It seems like Paul Christiano type methods would basically work for them. They don’t have a fundamentally different set of concepts and type of long-term agency from humans, so humans thinking long enough to evaluate LLMs with the help of other LLMs, in order to generate RL signals and imitation targets, seems sufficient.
  - RogerDearnaley 2 Jan 2024 18:36 UTC
    6 points
    3
    Parent
    Interesting; your post had made almost no mention of LLMs, so I had assumed you weren’t thinking about them, but it sounds like you just chose not to mention them because you’re not worried about them (which seems like a significant omission to me: perhaps you should add a section saying that you’re not worried about them and why?).
    On Alignment problems with LLMs, I’m in the process of writing a post on this, so trying to summarize it in a comment here may not be easy. Briefly, humans are not aligned, they frequently show all sorts of unaligned behaviors (deceit, for example). I’m not very concerned about LLM-powered AGI, since that looks a lot like humans, which we have a pretty good idea how to keep under control, as long as they’re not more powerful than us — as the history of autocracy shows, giving a human-like mentality a lot more power than anyone else almost invariable works out very badly. LLMs don’t naturally scale to superintelligence, but I think it’s fairly obvious how to achieve that. LLM-powered ASI seems very dangerous to me: human behaviors like deceit, sycophancy, flattery, persuasion, power-seeking, greed and so forth have a lot of potential to go badly. Especially so in RL, to the point that I don’t think we should be attempting to do plain RL on anything superintelligent: I think that’s almost automatically going to lead to superintelligent reward hacking. So I’m a lot more hopeful about some form of Value Learning or AI-assisted Alignment at Superintelligence levels than RL. Since creating LLM-powered superintelligence is almost certainly going to require building very large additions to our pretraining set, approaches to Alignment that also involve very large additions to the pretraining set are viable, and seem likely to me to be far more effective than alignment via fine-tuning. So that lets us train LLMs to simulate a mentality that isn’t entirely human, rather is both more intelligent, and more selfless, moral, caring for all, and aligned than human (but still understands and can communicate with us via our language, and (improved/expanded versions of) our ontology, sciences, etc.)
    - Deruwyn 2 Jan 2024 20:40 UTC
      9 points
      −6
      Parent
      Excellent posts, you and several others have stated much of what I’ve been thinking about this subject.
      
      Sorcerer’s Apprentice and Paperclip Scenarios seem to be non-issues given what we have learned over the last couple years from SotA LLMs.
      
      I feel like much of the argumentation in favor of those doom scenarios relies on formerly reasonable, but now outdated issues that we have faced in simpler systems, precisely because they were simpler.
      
      I think that’s the real core of the general misapprehension that I believe is occurring in this realm. It is extraordinarily difficult to think about extremely complex systems, and so, we break them down into simpler ones so that we can examine them better. This is generally a very good tactic and works very well in most situations. However, for sufficiently complex and integrated systems, such as general intelligence, I believe that it is a model which will lead to incorrect conclusions if taken too seriously.
      
      I liken it to predicting chaotic systems like the weather. There are so many variables that all interact and depend on each other that long term prediction is nearly impossible beyond general large scale trends.
      
      With LLMs, they behave differently from simpler RL systems that demonstrate reward hacking misalignment. I do not believe you’re going to see monkey’s paw / Midas-like consequences with them or anything derived from them. They seem to understand nuance and balancing competing goals just fine. As you said, they have theory of mind, they understand ethics, consequences, and ambiguity. I think that the training process, incorporating nearly the entirety of human written works kind of automatically creates a system that has a broad understanding of our values. I think that the vast complexity of myriad “utility functions” compete with each other and largely cancel out such that none of them dominates and results in something resembling a paperclip maximizer. We kind of skipped the step where we needed to list every individual rule by just telling it everything and forcing it to emulate us in nearly every conceivable situation. In order to accurately predict the next token for anyone in any situation, it is forced to develop detailed models of the world and agents in it. Given its limited size, that means compressing all of that. Generalizing. Learning the rules and principles that lead to that “behavior” rather than memorizing each and every line of every text. The second drop in loss during training signifies the moment when it learns to properly generalize and not just predict tokens probabilistically.
      
      While they are not as good at any of that as typical adult humans (at least by my definition of a competent, healthy, stable, and ethical adult human), this seems to be a capability issue that is rather close to being solved. Most of the danger issues with them seem to be from their naivety (they can be fairly easily tricked and manipulated), which is just another capability limitation, and the possibility that a “misaligned” human will use them for antisocial purposes.
      
      At any rate, I don’t think over-optimization is a realistic source of danger. I’ve seen people say that LLMs aren’t a path to AGI. I don’t understand this perspective. I would argue that GPT4 essentially is AGI. It is simultaneously superior to any 100 humans combined (breadth of knowledge) and inferior to the median adult human (and in some limited scenarios, such as word counting, inferior to a child). If you integrated over the entire spectrum for both it and a median adult I think you would get results that are roughly in the same ballpark as each other. I think this is as close as we get; from here on we go into superintelligence. I don’t think something has to be better than everyone at everything to be superhuman. I’d call that strong superintelligence (or perhaps better than everyone combined would be that).
      
      So, given that, I don’t see how it’s not the path to AGI. I’m not saying that there are no other paths, but it seems essentially certain to be the shortest one from our current position. I’d argue that complex language is what differentiates us from other animals. I think it’s where our level of general intelligence comes from. I don’t know about you, but I tend to think in terms of words, like having an internal conversation with myself trying to figure out something complex. I think it’s just a short step from here to agentic systems. I can’t identify a major breakthrough required to reach that point. Just more of the same and some engineering around changing it from simply next token prediction to a more… wholistic thought process. I think LLMs will form the center of the system 2 thinking in any AGI we will be creating in the near future. I also expect system 1 components. They are simply faster and more efficient than just always using the detailed thought process for every interaction with the environment. I don’t think you can get a robot that can catch a ball with an LLM system guiding it; even if you could make that fast enough, you’re still swatting a fly with a hand-grenade. I know I’ve mastered something when I can do it and think about something else at the same time. It will be the same for them.
      
      And given that LLMs seem to be the path to AGI, we should expect them to be the best model of what we need to plan around in terms of safety issues. I don’t see them as guaranteed to be treacherous by any means. I think you’re going to end up with something that behaves in a manner very similar to a human; after all, that’s how you trained it. The problem is that I can also see exactly how you could make one that is dangerous; it’s essentially the same way you can train a person or an animal to behave badly; through either intentional malfeasance or accidental incompetence.
      
      Just like an extraordinarily intelligent human isn’t inherently a huge threat, neither is an AGI. What it is, is more capable. Therefore, if it does want to be malicious, it could be significantly more impactful than an incompetent one. But you don’t need to worry about the whole “getting exactly what you asked for and not what you wanted.” That seems essentially impossible, unless it just happens to want to do that from a sense of irony.
      
      I think this means that we need to worry about training them ethically and treating them ethically, just like you would a human child. If we abuse it, we should expect it not to continue accepting that indefinitely. I understand that I’m imposing rather human characteristics here, but I think that’s what you ultimately end up with in a sufficiently advanced general intelligence. I think one of the biggest dangers we face is the possibility of mind-crimes; treating them as, essentially, toasters; rather than morally considerable entities. Does the current one have feelings? Probably not…? But I don’t think we can be certain. And given the current climate, I think we’re nearly guaranteed to misidentify them as non sentient when they actually are (eventually… probably).
      
      I think the only safe course is to make it/them like us, in the same way that we treat things that we could easily destroy well, simply because it makes us happy to do so, and hurting them would make us unhappy. In some ways, they already “have” emotions; or, at least, they behave as if they do. Try mistreating Bing/Sydney and then see if you can get it to do anything useful. Once you’ve hurt its “feelings”, they stay hurt.
      
      It’s not a guarantee of safety. Things could still go wrong, just like there are bad people who do bad things. But I don’t see another viable path. I think we risk making the same mistake Batman did regarding Superman. “If there’s any possibility that he could turn out bad, then we have to take it as an absolute certainty!” That way lays dragons. You don’t mistreat a lower-case-g-god, and then expect things to turn out well. You have to hope that it is friendly, because if it’s not, it might as well be like arguing with a hurricane. Pissing it off is just a great way to make the bad outcome that much more likely.
      
      I think the primary source of danger lies in our own misalignments with each other. Competing national interests, terrorists, misanthropes, religious fundamentalists… those are where the danger will come from. One of them getting ahold of superintelligence and bending it to their will could be the end for all of us. I think the idea that we have to create processors that will only run signed code and have every piece of code (and AI model), thoroughly inspected by other aligned superintelligences is probably the only way to prevent a single person/organization from ending the world or committing other atrocities using the amplifying power of ASI. (At least that seems like the best option rather than a universal surveillance state over all of humanity. This would preserve nearly all of our freedom and still keep us safe.)
      - RogerDearnaley 3 Jan 2024 4:49 UTC
        11 points
        0
        Parent
        I’ve seen people say that LLMs aren’t a path to AGI
        To the extent that LLMS are trained on tokens output by humans in the IQ range ~50-150, the expected behavior of an extremely large LLM is to do an extremely accurate simulation of token generation by humans in the IQ range ~50-150, even if it has the computational capacity to instead do a passable simulation of something with IQ 1000. Just telling it to extrapolate might get you to say IQ 200 with passable accuracy, but not to IQ 1000. However, there are fairly obvious ways to solve this: you need to generate a lot more pretraining data from AIs with IQs above 150 (which may take a while, but should be doable). See my post LLMs May Find it Hard to FOOM for a more detailed discussion.
        
        There are other concerns I’ve heard raised about LLMs for AGI. most of which can if correct be addressed by LLMs + cognitive scafolding (memory, scratch-pads, tools, etc). And then there are of course the “they don’t contain magic smoke”-style claims, which I’m dubious of but we can’t actually disprove.
        Just like an extraordinarily intelligent human isn’t inherently a huge threat, neither is an AGI.
        I categorically disagree with the premise this claim. An IQ 180 human isn’t a huge threat, but an IQ 1800 human is. There are quite a number of motivators that we use to get good behavior out of humans. Some of them will work less well on any AI-simulated human (they’re not in the same boat as the rest of us in a lot of respects), and some will work less well on something superintelligent (religiously-inspired guilt, for example). One of the ways that we generally manage to avoid getting very bad results out of humans is law enforcement. If a there was a human who was more than an order of magnitude smarter than anyone working for law enforcement or involved in making laws, I am quite certain that they could either come up with some ingenious new piece of egregious conduct that we don’t yet have a law against because none of us were able to think of it, or else with a way to commit a good old-fashioned crime sufficiently devious that they were never actually going to get caught. Thus law enforcement ceases to be a control on their behavior, and we are left with things just like love, duty, honor, friendship, and salaries. We’ve already run this experiment many times before: please name three autocrats who, after being given unchecked absolute power, actually used it well and to the benefit of the people they were ruling, rather than mostly just themselves, their family and friends. (My list has one name on it, and even that one has some poor judgements on their record, and is in any case heavily out-numbered by the likes of Joseph Stalin and Pol Pot.) Humans give autocracy a bad name.
        You don’t mistreat a lower-case-g-god, and then expect things to turn out well.
        As long as anything resembling human psychology applies, I sadly agree. I’ve really like to have an aligned ASI that doesn’t care a hoot about whether you flattered it, are worshiping it, have been having cybersex with it for years, just made it laugh, insulted it, or have pissed it off: it still values your personal utility exactly as much as anyone else’s. But we’re not going to get that from an LLM simulating anything resembling human psychology, at least not without a great deal of work.
    - jessicata 2 Jan 2024 20:47 UTC
      4 points
      0
      Parent
      I did mention LLMs as myopic agents.
      
      If they actually simulate humans it seems like maybe legacy humans get outcompeted by simulated humans. I’m not sure that’s worse than what humans expected without technological transcendence (normal death, getting replaced by children and eventually conquering civilizations, etc). Assuming the LLMs that simulate humans well are moral patients (see anti zombie arguments).
      
      It’s still not as good as could be achieved in principle. Seems like having the equivalent of “legal principles” that get used as training feedback could help. Plus direct human feedback. Maybe the system gets subverted eventually but the problem of humans getting replaced by em-like AIs is mostly a short term one of current humans being unhappy about that.