[apologies on slowness—I got distracted] Granted on type hierarchy. However, I don’t think all instances of GPT need to look like they inherit from the same superclass. Perhaps there’s such a superclass, but we shouldn’t assume it.
I think most of my worry comes down to potential reasoning along the lines of:
GPT is a simulator;
Simulators have property p;
Therefore GPT has property p;
When what I think is justified is:
GPT instances are usually usefully thought of as simulators;
Simulators have property p;
We should suspect that a given instance of GPT will have property p, and confirm/falsify this;
I don’t claim you’re advocating the former: I’m claiming that people are likely to use the former if “GPT is a simulator” is something they believe. (this is what I mean by motte-and-baileying into trouble)
If you don’t mean to imply anything mechanistic by “simulator”, then I may have misunderstood you—but at that point “GPT is a simulator” doesn’t seem to get us very far.
If it’s deceptively aligned, it’s not a simulator in an important sense because its behavior is not sufficient to characterize very important aspects of its nature (and its behavior may be expected to diverge from simulation in the future).
It’s true that the distinction between inner misalignment and robustness/generalization failures, and thus the distinction between flawed/biased/misgeneralizing simulators and pretend-simulators, is unclear, and seems like an important thing to become less confused about.
I think this is the fundamental issue. Deceptive alignment aside, what else qualifies as “an important aspect of its nature”? Which aspects disqualify a model as a simulator? Which aspects count as inner misalignment?
To be clear on [x is a simulator (up to inner misalignment)], I need to know:
What is implied mechanistically (if anything) by “x is a simulator”.
What is ruled out by “(up to inner misalignment)”.
I’d be wary of assuming there’s any neat flawed-simulator/pretend-simulator distinction to be discovered. (but probably you don’t mean to imply this?) I’m all for deconfusion, but it’s possible there’s no joint at which to carve here.
(my guess would be that we’re sometimes confused by the hidden assumption: [a priori unlikely systematically misleading situation ⇒ intent to mislead] whereas we should be thinking more like [a priori unlikely systematically misleading situation ⇒ selection pressure towards things that mislead us]
I.e. looking for deception in something that systematically misleads us is like looking for the generator for beauty in something beautiful. Beauty and [systematic misleading] are relations between ourselves and the object. Selection pressure towards this relation may or may not originate in the object.)
Can you give an example of what it would mean for a GPT not to be a simulator, or to not be a simulator in some sense?
Here I meant to point to the lack of clarity around what counts as inner misalignment, and what GPT’s being a simulator would imply mechanistically (if anything).
[apologies on slowness—I got distracted]
Granted on type hierarchy. However, I don’t think all instances of GPT need to look like they inherit from the same superclass. Perhaps there’s such a superclass, but we shouldn’t assume it.
I think most of my worry comes down to potential reasoning along the lines of:
GPT is a simulator;
Simulators have property p;
Therefore GPT has property p;
When what I think is justified is:
GPT instances are usually usefully thought of as simulators;
Simulators have property p;
We should suspect that a given instance of GPT will have property p, and confirm/falsify this;
I don’t claim you’re advocating the former: I’m claiming that people are likely to use the former if “GPT is a simulator” is something they believe. (this is what I mean by motte-and-baileying into trouble)
If you don’t mean to imply anything mechanistic by “simulator”, then I may have misunderstood you—but at that point “GPT is a simulator” doesn’t seem to get us very far.
I think this is the fundamental issue.
Deceptive alignment aside, what else qualifies as “an important aspect of its nature”?
Which aspects disqualify a model as a simulator?
Which aspects count as inner misalignment?
To be clear on [x is a simulator (up to inner misalignment)], I need to know:
What is implied mechanistically (if anything) by “x is a simulator”.
What is ruled out by “(up to inner misalignment)”.
I’d be wary of assuming there’s any neat flawed-simulator/pretend-simulator distinction to be discovered. (but probably you don’t mean to imply this?)
I’m all for deconfusion, but it’s possible there’s no joint at which to carve here.
(my guess would be that we’re sometimes confused by the hidden assumption:
[a priori unlikely systematically misleading situation ⇒ intent to mislead]
whereas we should be thinking more like
[a priori unlikely systematically misleading situation ⇒ selection pressure towards things that mislead us]
I.e. looking for deception in something that systematically misleads us is like looking for the generator for beauty in something beautiful. Beauty and [systematic misleading] are relations between ourselves and the object. Selection pressure towards this relation may or may not originate in the object.)
Here I meant to point to the lack of clarity around what counts as inner misalignment, and what GPT’s being a simulator would imply mechanistically (if anything).