The most useful definition of “mesa-optimizer” doesn’t require them to perform explicit search, contrary to the current standard.
And presumably, the extent to which search takes place isn’t important, a measure of risk, or optimizing. (In other words, it’s not a part of the definition, and it shouldn’t be a part of the definition.)
Some of the reasons we expect mesa-search also apply to mesa-control more broadly.
expect mesa-search might be a problem?
Highly knowledge-based strategies, such as calculus, which find solutions “directly” with no iteration—but which still involve meaningful computation.
This explains ‘search might not be the only problem’ rather well (even if isn’t the only alternative).
Dumb lookup tables.
Hm. Based on earlier:
Mesa-controller refers to any effective strategies, including mesa-searchers but also “dumber” strategies which nonetheless effectively steer toward a misaligned objective. For example, thermostat-like strategies, or strategies which have simply memorized a number of effective interventions.
It sounds like there’s also a risk of smart lookup tables. That might not be the right terminology, but ‘look up tables which contain really effective things’, even if the tables themselves just execute and don’t change, seems worth pointing out somehow.
I think mesa-control is thought of as a less concerning problem than mesa-search, primarily because: how would you even get severely misaligned mesa-controllers? For example, why would a neural network memorize highly effective strategies for pursuing an objective which it hasn’t been trained on?
AgentOne learns to predict AgentTwo so they don’t run into each other as they navigate their environment and try to pursue their own goals or strategies (jointly or separately).
Something which isn’t a neural network might?
If people don’t want to worry about catastrophic forgetting, they might just freeze the network. (Training phase, thermostat phase.)
Someone copies a trained network, instead of training from scratch—accidentally.
Malware
The point of inner alignment is to protect against those bad consequences. If mesa-controllers which don’t search are truly less concerning, this just means it’s an easier case to guard against. That’s not an argument against including them in the definition of the inner alignment problem.
A controller, mesa- or otherwise, may be a tool another agent creates or employs to obtain their objectives. (For instance, if someone creates malware that hacks your thermostat to build a bigger botnet (yay Internet of Things!). It might be better to think of the ‘intelligence/power/effectiveness of an object for reaching a goal’ (even for a rock) to be seen as a function of the system, rather than the parts.)
If you used your chess experience to create a lookup table that could beat me at chess, it’s ‘intelligence’ would be an expression of your int/optimization.
For non-search strategies, it’s even more important that the goal actually simplify the problem as opposed to merely reiterate it; so there’s even more reason to think that mesa-controllers of this type wouldn’t be aligned with the outer goal.
How does a goal simplify a problem?
My model is that GPT-3 almost certainly is “hiding its intelligence” at least in small ways. For example, if its prompt introduces spelling mistakes, GPT-3 will ‘intentionally’ continue with more spelling mistakes in what it generates.
Yeah, because it’s goal is prediction. Within prediction there isn’t a right way to write a sentence. It’s not a spelling mistake, it’s a spelling prediction. (If you want it to not do that, then train it on...predicting the sentence, spelled correctly. Reward correct spelling, with a task of ‘seeing through the noise’. You could try going further, and reinforce a particular style, or ‘this word is better than that word’.)
Train a model to predict upvotes on Quara, Stackxchange, and similar question-answering websites. This serves as a function recognizing “intelligent and helpful responses”.
Uh, that’s not what I’d expect it to do. If you’re worried about deception now, why don’t you think that’d make it worse? (If nothing else, are you trying to create GPT-Flattery?)
If this procedure works exceedingly well, causing GPT to “wake up” and be a human-level conversation partner or greater, we should be very worried indeed. (Since we wouldn’t then know the alignment of the resulting system, and could be virtually sure that it was an inner optimizer of significant power.)
It’s not an agent. It’s a predictor. (It doesn’t want to make paperclips.)
What I intended there was “expect mesa-search to happen at all” (particularly, mesa-search with its own goals)
It sounds like there’s also a risk of smart lookup tables. That might not be the right terminology, but ‘look up tables which contain really effective things’, even if the tables themselves just execute and don’t change, seems worth pointing out somehow.
Sorry, by “dumb” I didn’t really mean much, except that in some sense lookup tables are “not as smart” as the previous things in the list (not in terms of capabilities, but rather in terms of how much internal processing is going on).
How does a goal simplify a problem?
For example, you can often get better results out of RL methods if you include “shaping” rewards, which reward behaviors which you think will be useful in productive strategies, even though this technically creates misalignment and opportunities for perverse behavior. For example, if you wanted an RL agent to go to a specific square, you might do well to reward movement toward that square.
Similarly, part of the common story about how mesa-optimizers develop is: if they have explicitly represented values, these same kinds of “shaping” values will be adaptive to include, since they guide the search toward useful answers. Without this effect, inner search might not be worthwhile at all, due to inefficiency.
Yeah, because it’s goal is prediction. Within prediction there isn’t a right way to write a sentence. It’s not a spelling mistake, it’s a spelling prediction. (If you want it to not do that, then train it on...predicting the sentence, spelled correctly. Reward correct spelling, with a task of ‘seeing through the noise’. You could try going further, and reinforce a particular style, or ‘this word is better than that word’.)
Yes, I agree that GPT’s outer objective fn is misaligned with maximum usefulness, and a more aligned outer objective would make it do more of what we would want.
However, I feel like your “if you don’t want that, then...” seems to suppose that it’s easy to make it outer-aligned. I don’t think so.
The spelling example is relatively easy (we could apply an automated spellcheck to all the data, which would have some failure rate of course but is maybe good enough for most situations—or similarly, we could just apply a loss function for outputs which aren’t spelled correctly). But what’s the generalization of that?? How do you try to discourage all “deliberate mistakes”?
Uh, that’s not what I’d expect it to do. If you’re worried about deception now, why don’t you think that’d make it worse? (If nothing else, are you trying to create GPT-Flattery?)
I don’t think it would be entirely aligned by any means. My prediction is that it’d be incentivized to reveal information (so you could say it’s differentially more “honest” relative to GPT-3 trained only on predictive accuracy). I agree that in the extreme case (if fine-tuned GPT-3 is really good at this) it could end up more deceptive rather than less (due to issues like flattery).
It’s not an agent. It’s a predictor. (It doesn’t want to make paperclips.)
I think you’re anthropomorphizing it.
This was meant to be an extreme case.
Why do you suppose it’s not an agent? Isn’t that essentially the question of inner optimizers? IE, does it get its own goals? Is it just trying to predict?
How do you try to discourage all “deliberate mistakes”?
1. Make something that has a goal. Does AlphaGo make deliberate mistakes at Go? Or does it try to win, and always make the best move* (with possible the limitation that, it might not be as good at playing from positions it wouldn’t play itself into)?
*This may be different from ‘maximize score, or wins long term’. If you try to avoid teaching your opponent how to play better, while seeking out wins, there can be a ‘try to meta game’ approach—though this might require games to have the right structure, especially in training to create a tournament, rather than game focus. And I would guess it is game focused, rather than tournament.
Why do you suppose it’s not an agent? Isn’t that essentially the question of inner optimizers? IE, does it get its own goals? Is it just trying to predict?
A fair point. Dealing with this at the level of ‘does it have goals’ is a question worth asking. I think that it, like AlphaGo, isn’t engaging in particularly deliberate action because I don’t think it is existing properly to do that, or learn to do that.
You think of the spelling errors as deception. Another way of characterizing it might be ‘trying to speak the lingo’. For example we might think of as an agent, that, if it chatted with you for a while, and you don’t use words like ‘aint’ a lot, might shift to not use words like that around you. (Is an agent that “knows its audience” deceptive? Maybe yes, maybe no.)
You think that there is a correct way to spell words. GPT might be more agnostic. For example, (it’s weird to not put this in terms of prediction) if another version of GPT (GPT-Speller) somehow ‘ignored context’, or ‘factored it ‘better″, then we might imagine Speller would spell words right with a probability. You and I understand that ‘words are spelled (mostly) one way’. But Speller, might come up with words as these probability distributions over strings—spelling things right most of the time (if the dataset has them spelled that way most of the time), but always getting them wrong sometimes because it:
Thinks that’s how words are. (Probability blobs. Most of the time “should” should be spelled “should”, but 1% or less it should be spelled “shoud”.)
Is very, but not completely certain it’s got things right. Even with the idea that there is one right way, there might be uncertainty about what that way is. (I think an intentional agent like us, as people, at some point might ask ‘how is this word spelled’, or pay attention to scores it gets, and try to adjust appropriately.**)
**Maybe some new (or existing) methods might be required to fix this? The issue of ‘imperfect feedback’ sounds like something that’s (probably) been an issue before—and not just in conjunction with the words ‘Goodhart’.
I also lean towards ‘this thing was created, and given something like a goal, and it’s going to keep doing that goal like thing’. If it ‘spells things wrong to fit in’ that’s because it was trained as a predictor, not a writer. If we want something to write, yeah, figuring out how to train that might be hard. If you want something out of GPT that differs from the objective ‘predict’ then maybe GPT needs to be modified, if prompting it correctly doesn’t work. Given the way it ‘can respond to prompts’ characterizing it as ‘deceptive’ might make sense under some circumstances*, but if you’re going to look at it that way, training something to do ‘prediction’ (of original text) and then have it ‘write’ is systematically going to result in ‘deception’ because it has been trained to be a chameleon. To blend in. To say what whoever wrote the string it is being tested against at the moment. It’s abilities are shocking and it’s easy to see them in an ‘action framework’. However, if it developed a model of the world, and it was possible to factor that out from the goal—then pulling the model out and getting ‘the truth’ is possible. But the two might not be separable. If trained on say “a flat earther dataset” will it say “the earth is round”? Can it actually achieve insight?
If you want a good writer, train a good writer. I’m guessing garbage in, garbage out, is an AI rule as much as straight up programming.*** If we give something the wrong rewards, the system will be gamed (absent a system (successfully) designed and deployed to not do that).
*i.e., it might have a mind, but it also might not. Rather it might just be that
***More because the AI has to ‘figure out’ what it is that you want, from scratch.
If GPT, when asked ‘is this spelled correctly: [string]’ it tells us truthfully, then as deception, that’s probably not an issue. As far as deception goes...arguably it’s ‘deceiving’ everyone all the time, that it is a human (assuming most text in it’s corpus is written by humans, and most prompts match that), or trying to. If it things it’s supposed to play the part of a someone who is bad at spelling, it might be hard to read.
(I haven’t heard of it making any new scientific discoveries*. Though if it hasn’t read a lot of papers, it could be trained...)
*This would be surprising, and might change the way I look at it—if a predictor can do that, what else can it do, and is the distinction between an agent an a predictor a meaningful one? Maybe not. Though pre-registration might be key here. If most of the time it just produces awful or mediocre papers, then maybe it’s just a ‘monkey at a typewriter’.
I’m a bit confused about part of what we’re disagreeing on, so, context trace:
I originally said:
My model is that GPT-3 almost certainly is “hiding its intelligence” at least in small ways. For example, if its prompt introduces spelling mistakes, GPT-3 will ‘intentionally’ continue with more spelling mistakes in what it generates.
Then you said:
Yeah, because it’s goal is prediction. Within prediction there isn’t a right way to write a sentence. It’s not a spelling mistake, it’s a spelling prediction. (If you want it to not do that, then train it on...predicting the sentence, spelled correctly. Reward correct spelling, with a task of ‘seeing through the noise’. You could try going further, and reinforce a particular style, or ‘this word is better than that word’.)
Then I said:
Yes, I agree that GPT’s outer objective fn is misaligned with maximum usefulness, and a more aligned outer objective would make it do more of what we would want.
However, I feel like your “if you don’t want that, then...” seems to suppose that it’s easy to make it outer-aligned. I don’t think so.
The spelling example is relatively easy (we could apply an automated spellcheck to all the data, which would have some failure rate of course but is maybe good enough for most situations—or similarly, we could just apply a loss function for outputs which aren’t spelled correctly). But what’s the generalization of that?? How do you try to discourage all “deliberate mistakes”?
Then you said:
1. Make something that has a goal. Does AlphaGo make deliberate mistakes at Go? Or does it try to win, and always make the best move* (with possible the limitation that, it might not be as good at playing from positions it wouldn’t play itself into)?
It seems like the discussion was originally about hidden information, not deliberate mistakes—deliberate mistakes were just an example of GPT taking information-hiding actions. I spuriously asked how to avoid all deliberate mistakes when what I intended had more to do with hidden information
The claim I was trying to support in that paragraph was (as stated in the directly preceding paragraph) it isn’t easy to make it outer-aligned. AlphaGo isn’t outer-aligned.
AlphaGo could be hiding a lot of information, like GPT. In AlphaGo’s case, information which AlphaGo doesn’t reveal to the user would include a lot of concepts about the state of the game, which aren’t revealed to human users easily. This isn’t particularly sinister, but, it is hidden information.
A hypothetical more-data-efficient AlphaGo which was trained only on playing humans (rather than self-play) could have an internal psychological model of humans. This would be “inaccessible information”. It could also implement deliberate deception to increase its win rate.
I get the vibe that I might be missing a broader point you’re trying to make. Maybe something like “you get what you ask for”—you’re pointing out that hiding information like this isn’t at all surprising given the loss function, and different loss functions imply different behavior, often in a straightforward way.
If this were your point, I would respond:
The point of the inner alignment problem is that, it seems, you don’t always get what you ask for.
I’m not trying to say it’s surprising that GPT would hide things in this way. Rather, this is a way of thinking about how GPT thinks and how sophisticated/coherent its internal world-model is (in contrast to what we can see by asking it questions). This seems like important, but indirect, information about inner optimizers.
You think of the spelling errors as deception. Another way of characterizing it might be ‘trying to speak the lingo’. For example we might think of as an agent, that, if it chatted with you for a while, and you don’t use words like ‘aint’ a lot, might shift to not use words like that around you. (Is an agent that “knows its audience” deceptive? Maybe yes, maybe no.)
You think that there is a correct way to spell words. GPT might be more agnostic.
I’m not sure whether there is any disagreement here. Certainly I tend to think about language differently from that. But I agree that’s the purely descriptive view.
I also lean towards ‘this thing was created, and given something like a goal, and it’s going to keep doing that goal like thing’. If it ‘spells things wrong to fit in’ that’s because it was trained as a predictor, not a writer.
I mean, I agree as a statistical tendency, but are you assuming away the inner alignment problem?
Given the way it ‘can respond to prompts’ characterizing it as ‘deceptive’ might make sense under some circumstances*, but if you’re going to look at it that way, training something to do ‘prediction’ (of original text) and then have it ‘write’ is systematically going to result in ‘deception’ because it has been trained to be a chameleon. To blend in.
We seem to be in agreement about this.
However, if it developed a model of the world, and it was possible to factor that out from the goal—then pulling the model out and getting ‘the truth’ is possible. But the two might not be separable. If trained on say “a flat earther dataset” will it say “the earth is round”? Can it actually achieve insight?
Right, this is the question I am interested in. Is there a world model? (To what degree?)
You’re right. We don’t seem to (largely) disagree about what is going on.
1. Intentional deception
My original reading of ‘deception’ and ‘information hiding’ (your words) was around (my distinction:) ‘is this an intelligent move?’ If a chameleon is on something green and it is green to blend in, that seems different from humans or other agents conducting deception.
If GPT is annoying because it’s hard to get it to do something because of the way it’s trained that’s one thing. But lying sounds like something different.
(As I’ve thought more about this, AI looks like a place where distinctions start to break down a little. Figuring out what’s going on isn’t necessarily going to be easy, though that seems like a different issue from ‘misunderstanding terminology’.)
2. Minor point
I mean, I agree as a statistical tendency, but are you assuming away the inner alignment problem?
Hm. I was assuming the issue was ‘an outer alignment problem’, rather than ‘an inner alignment problem’.
(With caveat that ‘a different purpose (writer) calls for a different method (than prediction), or more than just prediction’ is less ‘human value is complicated’ and more ‘writing is not prediction’*.
*GPT illustrates the shocking degree to which this isn’t true. (Though maybe plagiarism is (still) an issue.)
I wonder what an inner alignment problem with GPT would look like. Not memorizing the dataset? Not engaging in ‘deceptive behavior’?)
And presumably, the extent to which search takes place isn’t important, a measure of risk, or optimizing. (In other words, it’s not a part of the definition, and it shouldn’t be a part of the definition.)
expect mesa-search might be a problem?
This explains ‘search might not be the only problem’ rather well (even if isn’t the only alternative).
Hm. Based on earlier:
It sounds like there’s also a risk of smart lookup tables. That might not be the right terminology, but ‘look up tables which contain really effective things’, even if the tables themselves just execute and don’t change, seems worth pointing out somehow.
AgentOne learns to predict AgentTwo so they don’t run into each other as they navigate their environment and try to pursue their own goals or strategies (jointly or separately).
Something which isn’t a neural network might?
If people don’t want to worry about catastrophic forgetting, they might just freeze the network. (Training phase, thermostat phase.)
Someone copies a trained network, instead of training from scratch—accidentally.
Malware
A controller, mesa- or otherwise, may be a tool another agent creates or employs to obtain their objectives. (For instance, if someone creates malware that hacks your thermostat to build a bigger botnet (yay Internet of Things!). It might be better to think of the ‘intelligence/power/effectiveness of an object for reaching a goal’ (even for a rock) to be seen as a function of the system, rather than the parts.)
If you used your chess experience to create a lookup table that could beat me at chess, it’s ‘intelligence’ would be an expression of your int/optimization.
How does a goal simplify a problem?
Yeah, because it’s goal is prediction. Within prediction there isn’t a right way to write a sentence. It’s not a spelling mistake, it’s a spelling prediction. (If you want it to not do that, then train it on...predicting the sentence, spelled correctly. Reward correct spelling, with a task of ‘seeing through the noise’. You could try going further, and reinforce a particular style, or ‘this word is better than that word’.)
Uh, that’s not what I’d expect it to do. If you’re worried about deception now, why don’t you think that’d make it worse? (If nothing else, are you trying to create GPT-Flattery?)
It’s not an agent. It’s a predictor. (It doesn’t want to make paperclips.)
I think you’re anthropomorphizing it.
What I intended there was “expect mesa-search to happen at all” (particularly, mesa-search with its own goals)
Sorry, by “dumb” I didn’t really mean much, except that in some sense lookup tables are “not as smart” as the previous things in the list (not in terms of capabilities, but rather in terms of how much internal processing is going on).
For example, you can often get better results out of RL methods if you include “shaping” rewards, which reward behaviors which you think will be useful in productive strategies, even though this technically creates misalignment and opportunities for perverse behavior. For example, if you wanted an RL agent to go to a specific square, you might do well to reward movement toward that square.
Similarly, part of the common story about how mesa-optimizers develop is: if they have explicitly represented values, these same kinds of “shaping” values will be adaptive to include, since they guide the search toward useful answers. Without this effect, inner search might not be worthwhile at all, due to inefficiency.
Yes, I agree that GPT’s outer objective fn is misaligned with maximum usefulness, and a more aligned outer objective would make it do more of what we would want.
However, I feel like your “if you don’t want that, then...” seems to suppose that it’s easy to make it outer-aligned. I don’t think so.
The spelling example is relatively easy (we could apply an automated spellcheck to all the data, which would have some failure rate of course but is maybe good enough for most situations—or similarly, we could just apply a loss function for outputs which aren’t spelled correctly). But what’s the generalization of that?? How do you try to discourage all “deliberate mistakes”?
I don’t think it would be entirely aligned by any means. My prediction is that it’d be incentivized to reveal information (so you could say it’s differentially more “honest” relative to GPT-3 trained only on predictive accuracy). I agree that in the extreme case (if fine-tuned GPT-3 is really good at this) it could end up more deceptive rather than less (due to issues like flattery).
This was meant to be an extreme case.
Why do you suppose it’s not an agent? Isn’t that essentially the question of inner optimizers? IE, does it get its own goals? Is it just trying to predict?
1. Make something that has a goal. Does AlphaGo make deliberate mistakes at Go? Or does it try to win, and always make the best move* (with possible the limitation that, it might not be as good at playing from positions it wouldn’t play itself into)?
*This may be different from ‘maximize score, or wins long term’. If you try to avoid teaching your opponent how to play better, while seeking out wins, there can be a ‘try to meta game’ approach—though this might require games to have the right structure, especially in training to create a tournament, rather than game focus. And I would guess it is game focused, rather than tournament.
A fair point. Dealing with this at the level of ‘does it have goals’ is a question worth asking. I think that it, like AlphaGo, isn’t engaging in particularly deliberate action because I don’t think it is existing properly to do that, or learn to do that.
You think of the spelling errors as deception. Another way of characterizing it might be ‘trying to speak the lingo’. For example we might think of as an agent, that, if it chatted with you for a while, and you don’t use words like ‘aint’ a lot, might shift to not use words like that around you. (Is an agent that “knows its audience” deceptive? Maybe yes, maybe no.)
You think that there is a correct way to spell words. GPT might be more agnostic. For example, (it’s weird to not put this in terms of prediction) if another version of GPT (GPT-Speller) somehow ‘ignored context’, or ‘factored it ‘better″, then we might imagine Speller would spell words right with a probability. You and I understand that ‘words are spelled (mostly) one way’. But Speller, might come up with words as these probability distributions over strings—spelling things right most of the time (if the dataset has them spelled that way most of the time), but always getting them wrong sometimes because it:
Thinks that’s how words are. (Probability blobs. Most of the time “should” should be spelled “should”, but 1% or less it should be spelled “shoud”.)
Is very, but not completely certain it’s got things right. Even with the idea that there is one right way, there might be uncertainty about what that way is. (I think an intentional agent like us, as people, at some point might ask ‘how is this word spelled’, or pay attention to scores it gets, and try to adjust appropriately.**)
**Maybe some new (or existing) methods might be required to fix this? The issue of ‘imperfect feedback’ sounds like something that’s (probably) been an issue before—and not just in conjunction with the words ‘Goodhart’.
I also lean towards ‘this thing was created, and given something like a goal, and it’s going to keep doing that goal like thing’. If it ‘spells things wrong to fit in’ that’s because it was trained as a predictor, not a writer. If we want something to write, yeah, figuring out how to train that might be hard. If you want something out of GPT that differs from the objective ‘predict’ then maybe GPT needs to be modified, if prompting it correctly doesn’t work. Given the way it ‘can respond to prompts’ characterizing it as ‘deceptive’ might make sense under some circumstances*, but if you’re going to look at it that way, training something to do ‘prediction’ (of original text) and then have it ‘write’ is systematically going to result in ‘deception’ because it has been trained to be a chameleon. To blend in. To say what whoever wrote the string it is being tested against at the moment. It’s abilities are shocking and it’s easy to see them in an ‘action framework’. However, if it developed a model of the world, and it was possible to factor that out from the goal—then pulling the model out and getting ‘the truth’ is possible. But the two might not be separable. If trained on say “a flat earther dataset” will it say “the earth is round”? Can it actually achieve insight?
If you want a good writer, train a good writer. I’m guessing garbage in, garbage out, is an AI rule as much as straight up programming.*** If we give something the wrong rewards, the system will be gamed (absent a system (successfully) designed and deployed to not do that).
*i.e., it might have a mind, but it also might not. Rather it might just be that
***More because the AI has to ‘figure out’ what it is that you want, from scratch.
If GPT, when asked ‘is this spelled correctly: [string]’ it tells us truthfully, then as deception, that’s probably not an issue. As far as deception goes...arguably it’s ‘deceiving’ everyone all the time, that it is a human (assuming most text in it’s corpus is written by humans, and most prompts match that), or trying to. If it things it’s supposed to play the part of a someone who is bad at spelling, it might be hard to read.
(I haven’t heard of it making any new scientific discoveries*. Though if it hasn’t read a lot of papers, it could be trained...)
*This would be surprising, and might change the way I look at it—if a predictor can do that, what else can it do, and is the distinction between an agent an a predictor a meaningful one? Maybe not. Though pre-registration might be key here. If most of the time it just produces awful or mediocre papers, then maybe it’s just a ‘monkey at a typewriter’.
I’m a bit confused about part of what we’re disagreeing on, so, context trace:
I originally said:
Then you said:
Then I said:
Then you said:
It seems like the discussion was originally about hidden information, not deliberate mistakes—deliberate mistakes were just an example of GPT taking information-hiding actions. I spuriously asked how to avoid all deliberate mistakes when what I intended had more to do with hidden information
The claim I was trying to support in that paragraph was (as stated in the directly preceding paragraph) it isn’t easy to make it outer-aligned. AlphaGo isn’t outer-aligned.
AlphaGo could be hiding a lot of information, like GPT. In AlphaGo’s case, information which AlphaGo doesn’t reveal to the user would include a lot of concepts about the state of the game, which aren’t revealed to human users easily. This isn’t particularly sinister, but, it is hidden information.
A hypothetical more-data-efficient AlphaGo which was trained only on playing humans (rather than self-play) could have an internal psychological model of humans. This would be “inaccessible information”. It could also implement deliberate deception to increase its win rate.
I get the vibe that I might be missing a broader point you’re trying to make. Maybe something like “you get what you ask for”—you’re pointing out that hiding information like this isn’t at all surprising given the loss function, and different loss functions imply different behavior, often in a straightforward way.
If this were your point, I would respond:
The point of the inner alignment problem is that, it seems, you don’t always get what you ask for.
I’m not trying to say it’s surprising that GPT would hide things in this way. Rather, this is a way of thinking about how GPT thinks and how sophisticated/coherent its internal world-model is (in contrast to what we can see by asking it questions). This seems like important, but indirect, information about inner optimizers.
I’m not sure whether there is any disagreement here. Certainly I tend to think about language differently from that. But I agree that’s the purely descriptive view.
I mean, I agree as a statistical tendency, but are you assuming away the inner alignment problem?
We seem to be in agreement about this.
Right, this is the question I am interested in. Is there a world model? (To what degree?)
You’re right. We don’t seem to (largely) disagree about what is going on.
1. Intentional deception
My original reading of ‘deception’ and ‘information hiding’ (your words) was around (my distinction:) ‘is this an intelligent move?’ If a chameleon is on something green and it is green to blend in, that seems different from humans or other agents conducting deception.
If GPT is annoying because it’s hard to get it to do something because of the way it’s trained that’s one thing. But lying sounds like something different.
(As I’ve thought more about this, AI looks like a place where distinctions start to break down a little. Figuring out what’s going on isn’t necessarily going to be easy, though that seems like a different issue from ‘misunderstanding terminology’.)
2. Minor point
Hm. I was assuming the issue was ‘an outer alignment problem’, rather than ‘an inner alignment problem’.
(With caveat that ‘a different purpose (writer) calls for a different method (than prediction), or more than just prediction’ is less ‘human value is complicated’ and more ‘writing is not prediction’*.
*GPT illustrates the shocking degree to which this isn’t true. (Though maybe plagiarism is (still) an issue.)
I wonder what an inner alignment problem with GPT would look like. Not memorizing the dataset? Not engaging in ‘deceptive behavior’?)
But again, not much disagreement here.
EDIT note:
AI is less ‘garbage in, garbage out’, and more ‘I know there’s something bright in there, how do I get it out’?