Kaj_Sotala comments on Thoughts on “AI is easy to control” by Pope & Belrose

Kaj_Sotala 2 Dec 2023 13:36 UTC
LW: 7 AF: 5
0
AF
I mostly agree with what you say, just registering my disagreement/thoughts on some specific points. (Note that I haven’t yet read the page you’re responding to.)
Hopefully everyone on all sides can agree that if my LLM reliably exhibits a certain behavior—e.g. it outputs “apple” after a certain prompt—and you ask me “Why did it output ‘apple’, rather than ‘banana’?”, then it might take me decades of work to give you a satisfying intuitive answer.
Maybe? Depends on what exactly you mean by the word “might”, but it doesn’t seem obvious to me that this would need to be the case. My intuition from seeing the kinds of interpretability results we’ve seen so far, is that within less of a decade we’d already have a pretty rigorous theory and toolkit for answering these kinds of questions. At least assuming that we don’t keep switching to LLM architectures that work based on entirely different mechanisms and make all of the previous interpretability work irrelevant.
If by “might” you mean something like a “there’s at least a 10% probability that this could take decades to answer” then sure, I’d agree with that. Now I haven’t actually thought about this specific question very much before seeing it pop up in your post, so I might radically revise my intuition if I thought about it more, but at least it doesn’t seem immediately obvious to me that I should assign “it would take decades of work to answer this” a very high probability.
Instead, the authors make a big deal out of the fact that human innate drives are relatively simple (I think they mean “simple compared to a modern big trained ML model”, which I would agree with). I’m confused why that matters. Who cares if there’s a simple solution, when we don’t know what it is?
I would assume the intuition to be something like “if they’re simple, then given the ability to experiment on minds and access AI internals, it will be relatively easy to figure out how to make the same drives manifest in an AI; the amount of (theory + trial and error) required for that will not be as high as it would be if the drives were intrinsically complex”.
We can run large numbers of experiments to find the most effective interventions, and we can also run it in a variety of simulated environments and test whether it behaves as expected both with and without the cognitive intervention. Each time the AI’s “memories” can be reset, making the experiments perfectly reproducible and preventing the AI from adapting to our actions, very much unlike experiments in psychology and social science.
That sounds nice, but brain-like AGI (like most RL agents) does online learning. So if you run a bunch of experiments, then as soon as the AGI does anything whatsoever (e.g. reads the morning newspaper), your experiments are all invalid (or at least, open to question), because now your AGI is different than it was before (different ANN weights, not just different environment / different prompt). Humans are like that too, but LLMs are not.
There’s something to that, but this sounds too strong to me. If someone had hypothetically spent a year observing all of my behavior, having some sort of direct read access to what was happening in my mind, and also doing controlled experiments where they reset my memory and tested what happened with some different stimulus… it’s not like all of their models would become meaningless the moment I read the morning newspaper. If I had read morning newspapers before, they would probably have a pretty good model of what the likely range of updates for me would be.
Of course, if there was something very unexpected and surprising in the newspaper, that might cause a bigger update, but I expect that they would also have reasonably good models of the kinds of things that are likely to trigger major updates or significant emotional shifts in me. If they were at all competent, that’s specifically the kind of thing that I’d expect them to work on trying to find out!
And even if there was a major shift, I think it’s basically unheard of that literally everything about my thoughts and behavior would change. When I first understood the potentially transformative impact of AGI, it didn’t change the motor programs that determine how I walk or brush my teeth, nor did it significantly change what kinds of people I feel safe around (aside for some increase in trust toward other people who I felt “get it”). I think that human brains quite strongly preserve their behavior and prediction structures, just adjusting them somewhat when faced with new information. Most of the models and predictions you’ve made about an adult will tend to stay valid, though of course with children and younger people there’s much greater change.
Now, as it happens, humans do often imitate other humans. But other times they don’t. Anyway, insofar as humans-imitating-other-humans happens, it has to happen via a very different and much less direct algorithmic mechanism than how it happens in LLMs. Specifically, humans imitate other humans because they want to, i.e. because of the history of past reinforcement, directly or indirectly. Whereas a pretrained LLM will imitate human text with no RL or “wanting to imitate” at all, that’s just mechanically what it does.
In some sense yes, but it does also seem to me that prediction and desire does get conflated in humans in various ways, and that it would be misleading to say that the people in question want it. For example, I think about this post by @romeostevensit often:
Fascinating concept that I came across in military/police psychology dealing with the unique challenges people face in situations of extreme stress/danger: scenario completion. Take the normal pattern completion that people do and put fear blinders on them so they only perceive one possible outcome and they mechanically go through the motions *even when the outcome is terrible* and there were obvious alternatives. This leads to things like officers shooting *after* a suspect has already surrendered, having overly focused on the possibility of needing to shoot them. It seems similar to target fixation where people under duress will steer a vehicle directly into an obstacle that they are clearly perceiving (looking directly at) and can’t seem to tear their gaze away from. Or like a self fulfilling prophecy where the details of the imagined bad scenario are so overwhelming, with so little mental space for anything else that the person behaves in accordance with that mental picture even though it is clearly the mental picture of the *un*desired outcome.
I often try to share the related concept of stress induced myopia. I think that even people not in life or death situations can get shades of this sort of blindness to alternatives. It is unsurprising when people make sleep a priority and take internet/screen fasts that they suddenly see that the things they were regarding as obviously necessary are optional. In discussion of trauma with people this often seems to be an element of relationships sadly enough. They perceive no alternative and so they resign themselves to slogging it out for a lifetime with a person they are very unexcited about. This is horrific for both people involved.
It’s, of course, true that for an LLM, prediction is the only thing it can do, and that humans have a system of desires on top of that. But it looks to me that a lot of human behavior is just having LLM-ish predictive models of how someone like them would behave in a particular situation, which is also the reason why conceptual reframings the like one you can get in therapy can be so powerful (“I wasn’t lazy after all, I just didn’t have the right tools for being productive” can drastically reorient many predictions you’re making of yourself and thus your behavior). (See also my post on human LLMs, which has more examples.)
While it’s obviously true that there is a lot of stuff operating in brains besides LLM-like prediction, such as mechanisms that promote specific predictive models over other ones, that seems to me to only establish that “the human brain is not just LLM-like prediction”, while you seem to be saying that “the human brain does not do LLM-like prediction at all”. (Of course, “LLM-like prediction” is a vague concept and maybe we’re just using it differently and ultimately agree.)
- Steven Byrnes 2 Dec 2023 14:51 UTC
  LW: 16 AF: 7
  9
  AF Parent
  While it’s obviously true that there is a lot of stuff operating in brains besides LLM-like prediction, such as mechanisms that promote specific predictive models over other ones, that seems to me to only establish that “the human brain is not just LLM-like prediction”, while you seem to be saying that “the human brain does not do LLM-like prediction at all”. (Of course, “LLM-like prediction” is a vague concept and maybe we’re just using it differently and ultimately agree.)
  I disagree with whether that distinction matters:
  I think technical discussions of AI safety depend on the AI-algorithm-as-a-whole; I think “does the algorithm have such-and-such component” is not that helpful a question.
  So for example, here’s a nightmare-scenario that I think about often:
  - (step 1) Someone reads a bunch of discussions about LLM x-risk
  - (step 2) They come down on the side of “LLM x-risk is low”, and therefore (they think) it would be great if TAI is an LLM as opposed to some other type of AI
  - (step 3) So then they think to themselves: Gee, how do we make LLMs more powerful? Aha, they find a clever way to build an AI that combines LLMs with open-ended real-world online reinforcement learning or whatever.
  Even if (step 2) is OK (which I don’t want to argue about here), I am very opposed to (step 3), particularly the omission of the essential part where they should have said “Hey wait a minute, I had reasons for thinking that LLM x-risk is low, but do those reasons apply to this AI, which is not an LLM of the sort that I’m used to, but rather it’s a combination of LLM + open-ended real-world online reinforcement learning or whatever?” I want that person to step back and take a fresh look at every aspect of their preexisting beliefs about AI safety / control / alignment from the ground up, as soon as any aspect of the AI architecture and training approach changes, even if there’s still an LLM involved. :)
- Steven Byrnes 3 Dec 2023 22:45 UTC
  LW: 15 AF: 9
  6
  AF Parent
  There’s something to that, but this sounds too strong to me. If someone had hypothetically spent a year observing all of my behavior, having some sort of direct read access to what was happening in my mind, and also doing controlled experiments where they reset my memory and tested what happened with some different stimulus… it’s not like all of their models would become meaningless the moment I read the morning newspaper. If I had read morning newspapers before, they would probably have a pretty good model of what the likely range of updates for me would be.
  I dunno, I wrote “invalid (or at least, open to question)”. I don’t think that’s too strong. Like, just because it’s “open to question”, doesn’t mean that, upon questioning it, we won’t decide it’s fine. I.e., it’s not that the conclusion is necessarily wrong, it’s that the original argument for it is flawed.
  Of course I agree that the morning paper thing would probably be fine for humans, unless the paper somehow triggered an existential crisis, or I try a highly-addictive substance while reading it, etc. :)
  Some relevant context is: I don’t think it’s realistic to assume that, in the future, AI models will be only slightly fine-tuned in a deployment-specific way. I think the relevant comparison is more like “can your values change over the course of years”, not “can your values change after reading the morning paper?”
  Why do I think that? Well, let’s imagine a world where you could instantly clone an adult human. One might naively think that there would be no more on-the-job learning ever. Instead, (one might think), if you want a person to help with chemical manufacture, you open the catalog to find a human who already knows chemical manufacturing, and order a clone of them; and if you want a person to design widgets, you go to a different catalog page, and order a clone of a human widget design expert; so on.
  But I think that’s wrong.
  I claim there would be lots of demand to clone a generalist—a person who is generally smart and conscientious and can get things done, but not specifically an expert in metallurgy or whatever the domain is. And then, this generalist would be tasked with figuring out whatever domains and skills they didn’t already have.
  Why do I think that? Because there’s just too many possible specialties, and especially combinations of specialties, for a pre-screened clone-able human to already exist in each of them. Like, think about startup founders. They’re learning how to do dozens of things. Why don’t they outsource their office supply questions to an office supply expert, and their hiring questions to a hiring expert, etc.? Well they do to some extent, but there are coordination costs, and more importantly the experts would lack all the context necessary to understand what the ultimate goals are. What are the chances that there’s a pre-screened clone-able human that knows about the specific combination of things that a particular application needs (rural Florida zoning laws AND anti-lock brakes AND hurricane preparedness AND …)
  So instead I expect that future AIs will eventually do massive amounts of figuring-things-out in a nearly infinite variety of domains, and moreover that the figuring out will never end. (Just as the startup founder never stops needing to learn new things, in order to succeed.) So I don’t like plans where the AI is tested in a standardized way, and then it’s assumed that it won’t change much in whatever one of infinitely many real-world deployment niches it winds up in.