Martin Randall comments on Knocking Down My AI Optimist Strawman

Martin Randall Feb 20, 2025, 3:11 AM
2 points
0
Thanks to modern alignment methods, serious hostility or deception has been thoroughly stamped out.

AI optimists been totally knocked out by things like RLHF, becoming overly convinced of the AI’s alignment and capabilities just from it acting apparently-nicely.

I’m interested in how far you think we can reasonably extrapolate from the apparent niceness of an LLM. One extreme:

This LLM is apparently nice therefore it is completely safe, with no serious hostility or deception, and no unintended consequences.

This is false. Many apparently nice humans are not nice. Many nice humans are unsafe. Niceness can be hostile or deceptive in some conditions. And so on. But how about a more cautious claim?

This LLM appears to be nice, which is evidence that it is nice.

I can see the shape of a counter-argument like:
1. The lab won’t release a model if it doesn’t appear nice.
2. Therefore all models released by the lab will appear nice.
3. Therefore the apparent niceness of a specific model released by the lab is not surprising.
4. Therefore it is not evidence.
Maybe something like that?

Disclaimer: I’m not an AI optimist.
- tailcalled Feb 22, 2025, 12:42 PM
  4 points
  2
  Parent
  I think the clearest problems in current LLMs are what I discussed in the “People used to be worried about existential risk from misalignment, yet we have a good idea about what influence current AIs are having on the world, and it is basically going fine.” section. And this is probably a good example of what you are saying about how “Niceness can be hostile or deceptive in some conditions.”.
  For example, the issue of outsourcing tasks to an LLM to the point where one becomes dependent on it is arguably an issue of excessive niceness—though not exactly to the point where it becomes hostile or deceptive. But where it then does become deceptive in practice is that when you outsource a lot of your skills to the LLM, you start feeling like the LLM is a very intelligent guru that you can rely on, and then when you come up with a kind of half-baked idea, the RLHF makes the LLM praise you for your insight.
  A tricky thing with a claim like “This LLM appears to be nice, which is evidence that it is nice.” is what it means for it to “be nice”. I think the default conception of niceness is as a general factor underlying nice behaviors, where a nice behavior is considered something like an action that alleviates difficulties or gives something desired, possibly with the restriction that being nice is the end itself (or at least, not a means to an end which the person you’re treating nicely would disapprove of).
  The major hurdle in generalizing this conception to LLMs is in this last restriction—both in terms of which restriction to use, and in how that restriction generalizes to LLMs. If we don’t have any restriction at all, then it seems safe to say that LLMs are typically inhumanly nice. But obviously OpenAI makes ChatGPT so nice in order to get subscribers to earn money, so that could be said to violate the ulterior motive restriction. But it seems to me that this is only really profitable due to the massive economies of scale, so on a level of an individual conversation, the amount of niceness seems to exceed the amount of money transferred, and seems quite unconditional on the money situation, so it seems more natural to think of the LLM as being simply nice for the purpose of being nice.
  I think the more fundamental issue is that “nice” is a kind of confused concept (which is perhaps not so surprising considering the etymology of “nice”). Contrast for instance the following cultures:
  1. Everyone has strong, well-founded opinions on what makes for a good person, and they want there to be more good people. Because of these norms, they collaborate to teach each other skills, discuss philosophy, resolve neuroses, etc., to help each other be good, and that makes them all very good people. This goodness makes them all like and enjoy each other, and thus in lots of cases they conclude that the best thing they could do is to alleviate each other’s difficulties and give each other things they desire (even from a selfish perspective, as empowering the others means that the others are better and do more nice stuff).
  2. Nobody is quite sure how to be good, but everyone is quite sure that goodness is something that makes people increase social welfare/sum of utility. Everyone has learned that utility can be elicited by choices/revealed preferences. They look at what actions others take and on how they seem to feel in order to deduce information about goodness. This often leads to them executing nice behaviors because nice behaviors consistently makes people feel better in the period shortly after they were executed.
  They’re both “nice”, but the niceness of the two cultures have fundamentally different mechanisms with fundamentally different root causes and fundamentally different consequences. Even if they might both be high on the general factor of niceness, most nice behaviors have relatively small consequences, and so the majority of the consequence of their niceness is not determined by the overall level of the general factor of niceness, but instead by the nuances and long tails of their niceness, which differs a lot between the two cultures.
  Now, LLMs don’t do either of these, because they’re not human and they don’t have enough context to act according to either of these mechanisms. I don’t think one can really compare LLMs to anything other than themselves.
  - Martin Randall Feb 25, 2025, 2:36 AM
    2 points
    0
    Parent
    Thanks. This helped me realize/recall that when an LLM appears to be nice, much less follows from that than it would for a human. For example, a password-locked model could appear nice, but become very nasty if it reads a magic word. So my mental model for “this LLM appears nice” should be closer to “this chimpanzee appears nice” or “this alien appears nice” or “this religion appears nice” in terms of trust. Interpretability and other research can help, but then we’re moving further from human-based intuitions.