johnswentworth comments on Alignment: “Do what I would have wanted you to do”

johnswentworth 12 Jul 2024 19:20 UTC
24 points
14
Let’s assume a base model (i.e. not RLHF’d), since you asserted a way to turn the LM into a goal-driven chatbot via prompt engineering alone. So you put in some prompt, and somewhere in the middle of that prompt is a part which says “Do what (pre-ASI) X, having considered this carefully for a while, would have wanted you to do”, for some X.
The basic problem is that this hypothetical language model will not, in fact, do what X, having considered this carefully for a while, would have wanted it to do. What it will do is output text which statistically looks like it would come after that prompt, if the prompt appeared somewhere on the internet.
- sunwillrise 13 Jul 2024 12:36 UTC
  1 point
  0
  Parent
  The basic problem is that this hypothetical language model will not, in fact, do what X, having considered this carefully for a while, would have wanted it to do. What it will do is output text which statistically looks like it would come after that prompt, if the prompt appeared somewhere on the internet.
  The Waluigi effect seems relevant here. From the perspective of Simulator Theory, the prompt is meant to summon a careful simulacrum that follows the instruction to the T, but in reality, this works only if “on the actual internet characters described with that particular [prompt] are more likely to reply with correct answers.”
  Things can get even weirder and the model can collapse into the complete antithesis of the nice, friendly, aligned persona:
  Rules normally exist in contexts in which they are broken.
  When you spend many bits-of-optimisation locating a character, it only takes a few extra bits to specify their antipode.
  There’s a common trope in plots of protagonist vs antagonist.
- Oleg Trott 12 Jul 2024 21:10 UTC
  1 point
  −10
  Parent
  Technically true. But you could similarly argue that humans are just clumps of molecules following physical laws. Talking about human goals is a charitable interpretation.
  And if you are in a charitable mood, you could interpret LMs as absorbing the explicit and tacit knowledge of millions of Internet authors. A superior ML algorithm would just be doing this better (and maybe it wouldn’t need lower-quality data).
  - johnswentworth 12 Jul 2024 22:06 UTC
    9 points
    3
    Parent
    That is not how this works. Let’s walk through it for both the “human as clumps of molecules following physics” and the “LLM as next-text-on-internet predictor”.
    Humans as clumps of molecules following physics
    Picture a human attempting to achieve some goal—for concreteness, let’s say the human is trying to pick an apple from a high-up branch on an apple tree. Picture what that human does: they maybe get a ladder, or climb the tree, or whatever. They manage to pluck the apple from the tree and drop it in a basket.
    Now, imagine a detailed low-level simulation of the exact same situation: that same human trying to pick that same apple. Modulo quantum noise, what happens in that simulation? What do we see when we look at its outputs? Well, it looks like a human attempting to achieve some goal: the clump of molecules which is a human gets another clump which is a ladder, or climbs the clump which is the tree, or whatever.
    LLM as next-text-on-internet predictor
    Now imagine finding the text “Notes From a Prompt Factory” on the internet, today (because the LLM is trained on text from ~today). Imagine what text would follow that beginning on the internet today.
    The text which follows that beginning on the internet today is not, in fact, notes from a prompt factory. Instead, it’s fiction about a fictional prompt factory. So that’s the sort of thing we should expect a highly capable LLM to output following the prompt “Notes From a Prompt Factory”: fiction. The more capable it is, the more likely it is to correctly realize that that prompt precedes fiction.
    It’s not a question of whether the LLM is absorbing the explicit and tacit knowledge of internet authors; I’m perfectly happy to assume that it is. The issue is that the distribution of text on today’s internet which follows the prompt “Notes From a Prompt Factory” is not the distribution of text which would result from actual notes on an actual prompt factory. The highly capable LLM absorbs all that knowledge from internet authors, and then uses that knowledge to correctly predict that what follows the text “Notes From a Prompt Factory” will not be actual notes from an actual prompt factory.
    - Oleg Trott 13 Jul 2024 1:02 UTC
      2 points
      0
      Parent
      “Some content on the Internet is fabricated, and therefore we can never trust LMs trained on it”
      Is this a fair summary?
      - johnswentworth 13 Jul 2024 16:26 UTC
        3 points
        10
        Parent
        No, because we have tons of information about what specific kinds of information on the internet is/isn’t usually fabricated. It’s not like we have no idea at all which internet content is more/less likely to be fabricated.
        Information about, say, how to prove that there are infinitely many primes is probably not usually fabricated. It’s standard basic material, there’s lots of presentations of it, it’s not the sort of thing which people usually troll about. Yes, the distribution of internet text about the infinitude of primes contains more-than-zero trolling and mistakes and the like, but that’s not the typical case, so low-temperature sampling from the LLM should usually work fine for that use-case.
        On the other end of the spectrum, “fusion power plant blueprints” on the internet today will obviously be fictional and/or wrong, because nobody currently knows how to build a fusion power plant which works. This generalizes to most use-cases in which we try to get an LLM to do something (using only prompting on a base model) which nobody is currently able to do. Insofar as the LLM is able to do such things, that actually reflects suboptimal next-text prediction on its part.
      - Tapatakt 13 Jul 2024 11:54 UTC
        3 points
        2
        Parent
        I would add “and the kind of content you want to get from aligned AGI definitely is fabricated on the Internet today”. So the powerful LM trying to predict it will predict how the fabrication would look like.

johnswentworth comments on Alignment: “Do what I would have wanted you to do”

Humans as clumps of molecules following physics

LLM as next-text-on-internet predictor