abramdemski comments on Gradations of Inner Alignment Obstacles

abramdemski 30 Apr 2021 15:12 UTC
LW: 2 AF: 2
AF
I’m a bit confused about part of what we’re disagreeing on, so, context trace:
I originally said:
My model is that GPT-3 almost certainly is “hiding its intelligence” at least in small ways. For example, if its prompt introduces spelling mistakes, GPT-3 will ‘intentionally’ continue with more spelling mistakes in what it generates.
Then you said:
Yeah, because it’s goal is prediction. Within prediction there isn’t a right way to write a sentence. It’s not a spelling mistake, it’s a spelling prediction. (If you want it to not do that, then train it on...predicting the sentence, spelled correctly. Reward correct spelling, with a task of ‘seeing through the noise’. You could try going further, and reinforce a particular style, or ‘this word is better than that word’.)
Then I said:
Yes, I agree that GPT’s outer objective fn is misaligned with maximum usefulness, and a more aligned outer objective would make it do more of what we would want.
However, I feel like your “if you don’t want that, then...” seems to suppose that it’s easy to make it outer-aligned. I don’t think so.
The spelling example is relatively easy (we could apply an automated spellcheck to all the data, which would have some failure rate of course but is maybe good enough for most situations—or similarly, we could just apply a loss function for outputs which aren’t spelled correctly). But what’s the generalization of that?? How do you try to discourage all “deliberate mistakes”?
Then you said:
1. Make something that has a goal. Does AlphaGo make deliberate mistakes at Go? Or does it try to win, and always make the best move* (with possible the limitation that, it might not be as good at playing from positions it wouldn’t play itself into)?
1. It seems like the discussion was originally about hidden information, not deliberate mistakes—deliberate mistakes were just an example of GPT taking information-hiding actions. I spuriously asked how to avoid all deliberate mistakes when what I intended had more to do with hidden information
2. The claim I was trying to support in that paragraph was (as stated in the directly preceding paragraph) it isn’t easy to make it outer-aligned. AlphaGo isn’t outer-aligned.
3. AlphaGo could be hiding a lot of information, like GPT. In AlphaGo’s case, information which AlphaGo doesn’t reveal to the user would include a lot of concepts about the state of the game, which aren’t revealed to human users easily. This isn’t particularly sinister, but, it is hidden information.
4. A hypothetical more-data-efficient AlphaGo which was trained only on playing humans (rather than self-play) could have an internal psychological model of humans. This would be “inaccessible information”. It could also implement deliberate deception to increase its win rate.
I get the vibe that I might be missing a broader point you’re trying to make. Maybe something like “you get what you ask for”—you’re pointing out that hiding information like this isn’t at all surprising given the loss function, and different loss functions imply different behavior, often in a straightforward way.
If this were your point, I would respond:
- The point of the inner alignment problem is that, it seems, you don’t always get what you ask for.
- I’m not trying to say it’s surprising that GPT would hide things in this way. Rather, this is a way of thinking about how GPT thinks and how sophisticated/coherent its internal world-model is (in contrast to what we can see by asking it questions). This seems like important, but indirect, information about inner optimizers.
You think of the spelling errors as deception. Another way of characterizing it might be ‘trying to speak the lingo’. For example we might think of as an agent, that, if it chatted with you for a while, and you don’t use words like ‘aint’ a lot, might shift to not use words like that around you. (Is an agent that “knows its audience” deceptive? Maybe yes, maybe no.)
You think that there is a correct way to spell words. GPT might be more agnostic.
I’m not sure whether there is any disagreement here. Certainly I tend to think about language differently from that. But I agree that’s the purely descriptive view.
I also lean towards ‘this thing was created, and given something like a goal, and it’s going to keep doing that goal like thing’. If it ‘spells things wrong to fit in’ that’s because it was trained as a predictor, not a writer.
I mean, I agree as a statistical tendency, but are you assuming away the inner alignment problem?
Given the way it ‘can respond to prompts’ characterizing it as ‘deceptive’ might make sense under some circumstances*, but if you’re going to look at it that way, training something to do ‘prediction’ (of original text) and then have it ‘write’ is systematically going to result in ‘deception’ because it has been trained to be a chameleon. To blend in.
We seem to be in agreement about this.
However, if it developed a model of the world, and it was possible to factor that out from the goal—then pulling the model out and getting ‘the truth’ is possible. But the two might not be separable. If trained on say “a flat earther dataset” will it say “the earth is round”? Can it actually achieve insight?
Right, this is the question I am interested in. Is there a world model? (To what degree?)
- Pattern 30 Apr 2021 17:47 UTC
  2 points
  Parent
  You’re right. We don’t seem to (largely) disagree about what is going on.
  1. Intentional deception
  My original reading of ‘deception’ and ‘information hiding’ (your words) was around (my distinction:) ‘is this an intelligent move?’ If a chameleon is on something green and it is green to blend in, that seems different from humans or other agents conducting deception.
  If GPT is annoying because it’s hard to get it to do something because of the way it’s trained that’s one thing. But lying sounds like something different.
  
  (As I’ve thought more about this, AI looks like a place where distinctions start to break down a little. Figuring out what’s going on isn’t necessarily going to be easy, though that seems like a different issue from ‘misunderstanding terminology’.)
  
  2. Minor point
  I mean, I agree as a statistical tendency, but are you assuming away the inner alignment problem?
  Hm. I was assuming the issue was ‘an outer alignment problem’, rather than ‘an inner alignment problem’.
  (With caveat that ‘a different purpose (writer) calls for a different method (than prediction), or more than just prediction’ is less ‘human value is complicated’ and more ‘writing is not prediction’*.
  *GPT illustrates the shocking degree to which this isn’t true. (Though maybe plagiarism is (still) an issue.)
  
  I wonder what an inner alignment problem with GPT would look like. Not memorizing the dataset? Not engaging in ‘deceptive behavior’?)
  
  But again, not much disagreement here.