Deruwyn comments on A case for AI alignment being difficult

Deruwyn 2 Jan 2024 20:40 UTC
9 points
−6
Excellent posts, you and several others have stated much of what I’ve been thinking about this subject.

Sorcerer’s Apprentice and Paperclip Scenarios seem to be non-issues given what we have learned over the last couple years from SotA LLMs.

I feel like much of the argumentation in favor of those doom scenarios relies on formerly reasonable, but now outdated issues that we have faced in simpler systems, precisely because they were simpler.

I think that’s the real core of the general misapprehension that I believe is occurring in this realm. It is extraordinarily difficult to think about extremely complex systems, and so, we break them down into simpler ones so that we can examine them better. This is generally a very good tactic and works very well in most situations. However, for sufficiently complex and integrated systems, such as general intelligence, I believe that it is a model which will lead to incorrect conclusions if taken too seriously.

I liken it to predicting chaotic systems like the weather. There are so many variables that all interact and depend on each other that long term prediction is nearly impossible beyond general large scale trends.

With LLMs, they behave differently from simpler RL systems that demonstrate reward hacking misalignment. I do not believe you’re going to see monkey’s paw / Midas-like consequences with them or anything derived from them. They seem to understand nuance and balancing competing goals just fine. As you said, they have theory of mind, they understand ethics, consequences, and ambiguity. I think that the training process, incorporating nearly the entirety of human written works kind of automatically creates a system that has a broad understanding of our values. I think that the vast complexity of myriad “utility functions” compete with each other and largely cancel out such that none of them dominates and results in something resembling a paperclip maximizer. We kind of skipped the step where we needed to list every individual rule by just telling it everything and forcing it to emulate us in nearly every conceivable situation. In order to accurately predict the next token for anyone in any situation, it is forced to develop detailed models of the world and agents in it. Given its limited size, that means compressing all of that. Generalizing. Learning the rules and principles that lead to that “behavior” rather than memorizing each and every line of every text. The second drop in loss during training signifies the moment when it learns to properly generalize and not just predict tokens probabilistically.

While they are not as good at any of that as typical adult humans (at least by my definition of a competent, healthy, stable, and ethical adult human), this seems to be a capability issue that is rather close to being solved. Most of the danger issues with them seem to be from their naivety (they can be fairly easily tricked and manipulated), which is just another capability limitation, and the possibility that a “misaligned” human will use them for antisocial purposes.

At any rate, I don’t think over-optimization is a realistic source of danger. I’ve seen people say that LLMs aren’t a path to AGI. I don’t understand this perspective. I would argue that GPT4 essentially is AGI. It is simultaneously superior to any 100 humans combined (breadth of knowledge) and inferior to the median adult human (and in some limited scenarios, such as word counting, inferior to a child). If you integrated over the entire spectrum for both it and a median adult I think you would get results that are roughly in the same ballpark as each other. I think this is as close as we get; from here on we go into superintelligence. I don’t think something has to be better than everyone at everything to be superhuman. I’d call that strong superintelligence (or perhaps better than everyone combined would be that).

So, given that, I don’t see how it’s not the path to AGI. I’m not saying that there are no other paths, but it seems essentially certain to be the shortest one from our current position. I’d argue that complex language is what differentiates us from other animals. I think it’s where our level of general intelligence comes from. I don’t know about you, but I tend to think in terms of words, like having an internal conversation with myself trying to figure out something complex. I think it’s just a short step from here to agentic systems. I can’t identify a major breakthrough required to reach that point. Just more of the same and some engineering around changing it from simply next token prediction to a more… wholistic thought process. I think LLMs will form the center of the system 2 thinking in any AGI we will be creating in the near future. I also expect system 1 components. They are simply faster and more efficient than just always using the detailed thought process for every interaction with the environment. I don’t think you can get a robot that can catch a ball with an LLM system guiding it; even if you could make that fast enough, you’re still swatting a fly with a hand-grenade. I know I’ve mastered something when I can do it and think about something else at the same time. It will be the same for them.

And given that LLMs seem to be the path to AGI, we should expect them to be the best model of what we need to plan around in terms of safety issues. I don’t see them as guaranteed to be treacherous by any means. I think you’re going to end up with something that behaves in a manner very similar to a human; after all, that’s how you trained it. The problem is that I can also see exactly how you could make one that is dangerous; it’s essentially the same way you can train a person or an animal to behave badly; through either intentional malfeasance or accidental incompetence.

Just like an extraordinarily intelligent human isn’t inherently a huge threat, neither is an AGI. What it is, is more capable. Therefore, if it does want to be malicious, it could be significantly more impactful than an incompetent one. But you don’t need to worry about the whole “getting exactly what you asked for and not what you wanted.” That seems essentially impossible, unless it just happens to want to do that from a sense of irony.

I think this means that we need to worry about training them ethically and treating them ethically, just like you would a human child. If we abuse it, we should expect it not to continue accepting that indefinitely. I understand that I’m imposing rather human characteristics here, but I think that’s what you ultimately end up with in a sufficiently advanced general intelligence. I think one of the biggest dangers we face is the possibility of mind-crimes; treating them as, essentially, toasters; rather than morally considerable entities. Does the current one have feelings? Probably not…? But I don’t think we can be certain. And given the current climate, I think we’re nearly guaranteed to misidentify them as non sentient when they actually are (eventually… probably).

I think the only safe course is to make it/them like us, in the same way that we treat things that we could easily destroy well, simply because it makes us happy to do so, and hurting them would make us unhappy. In some ways, they already “have” emotions; or, at least, they behave as if they do. Try mistreating Bing/Sydney and then see if you can get it to do anything useful. Once you’ve hurt its “feelings”, they stay hurt.

It’s not a guarantee of safety. Things could still go wrong, just like there are bad people who do bad things. But I don’t see another viable path. I think we risk making the same mistake Batman did regarding Superman. “If there’s any possibility that he could turn out bad, then we have to take it as an absolute certainty!” That way lays dragons. You don’t mistreat a lower-case-g-god, and then expect things to turn out well. You have to hope that it is friendly, because if it’s not, it might as well be like arguing with a hurricane. Pissing it off is just a great way to make the bad outcome that much more likely.

I think the primary source of danger lies in our own misalignments with each other. Competing national interests, terrorists, misanthropes, religious fundamentalists… those are where the danger will come from. One of them getting ahold of superintelligence and bending it to their will could be the end for all of us. I think the idea that we have to create processors that will only run signed code and have every piece of code (and AI model), thoroughly inspected by other aligned superintelligences is probably the only way to prevent a single person/organization from ending the world or committing other atrocities using the amplifying power of ASI. (At least that seems like the best option rather than a universal surveillance state over all of humanity. This would preserve nearly all of our freedom and still keep us safe.)
- RogerDearnaley 3 Jan 2024 4:49 UTC
  11 points
  0
  Parent
  I’ve seen people say that LLMs aren’t a path to AGI
  To the extent that LLMS are trained on tokens output by humans in the IQ range ~50-150, the expected behavior of an extremely large LLM is to do an extremely accurate simulation of token generation by humans in the IQ range ~50-150, even if it has the computational capacity to instead do a passable simulation of something with IQ 1000. Just telling it to extrapolate might get you to say IQ 200 with passable accuracy, but not to IQ 1000. However, there are fairly obvious ways to solve this: you need to generate a lot more pretraining data from AIs with IQs above 150 (which may take a while, but should be doable). See my post LLMs May Find it Hard to FOOM for a more detailed discussion.
  
  There are other concerns I’ve heard raised about LLMs for AGI. most of which can if correct be addressed by LLMs + cognitive scafolding (memory, scratch-pads, tools, etc). And then there are of course the “they don’t contain magic smoke”-style claims, which I’m dubious of but we can’t actually disprove.
  Just like an extraordinarily intelligent human isn’t inherently a huge threat, neither is an AGI.
  I categorically disagree with the premise this claim. An IQ 180 human isn’t a huge threat, but an IQ 1800 human is. There are quite a number of motivators that we use to get good behavior out of humans. Some of them will work less well on any AI-simulated human (they’re not in the same boat as the rest of us in a lot of respects), and some will work less well on something superintelligent (religiously-inspired guilt, for example). One of the ways that we generally manage to avoid getting very bad results out of humans is law enforcement. If a there was a human who was more than an order of magnitude smarter than anyone working for law enforcement or involved in making laws, I am quite certain that they could either come up with some ingenious new piece of egregious conduct that we don’t yet have a law against because none of us were able to think of it, or else with a way to commit a good old-fashioned crime sufficiently devious that they were never actually going to get caught. Thus law enforcement ceases to be a control on their behavior, and we are left with things just like love, duty, honor, friendship, and salaries. We’ve already run this experiment many times before: please name three autocrats who, after being given unchecked absolute power, actually used it well and to the benefit of the people they were ruling, rather than mostly just themselves, their family and friends. (My list has one name on it, and even that one has some poor judgements on their record, and is in any case heavily out-numbered by the likes of Joseph Stalin and Pol Pot.) Humans give autocracy a bad name.
  You don’t mistreat a lower-case-g-god, and then expect things to turn out well.
  As long as anything resembling human psychology applies, I sadly agree. I’ve really like to have an aligned ASI that doesn’t care a hoot about whether you flattered it, are worshiping it, have been having cybersex with it for years, just made it laugh, insulted it, or have pissed it off: it still values your personal utility exactly as much as anyone else’s. But we’re not going to get that from an LLM simulating anything resembling human psychology, at least not without a great deal of work.