However, I think that assuming there’s a “true name” or “abstract type that GPT represents” is an error.
If GPT means “transformers trained on next-token prediction”, then GPT’s true name is just that. The character of the models produced by that training is another question—an empirical one. That character needn’t be consistent (even once we exclude inner alignment failures).
Even if every GPT is a simulator in some sense, I think there’s a risk of motte-and-baileying our way into trouble.
If GPT means “transformers trained on next-token prediction”, then GPT’s true name is just that.
Things are instances of more than one true name because types are hierarchical.
GPT is a thing. GPT is an AI (a type of thing). GPT is a also ML model (a type of AI). GPT is also a simulator (a type of ML model). GPT is a generative pretrained transformer (a type of simulator). GPT-3 is a generative pretrained transformer with 175B parameters trained on a particular dataset (a type/instance of GPT).
The intention is not to rename GPT → simulator. Things that are not GPT can be simulators too. “Simulator” is a superclass of GPT.
The reason I propose “simulator” as a named category is because I think it’s useful to talk about properties of simulators more generally, like it makes sense to be able to speak of “AI alignment” and not only “GPT alignment”. We can say things like “simulators generate trajectories that evolve according to the learned conditional probabilities of the training distribution” instead of “GPTs, RNNs, LSTMs, Dalle, n-grams, and RL transition models generate trajectories that evolve according to the learned conditional probabilities of the training distribution”. The former statement also accounts for hypothetical architectures. Carving reality at its joints is not just about classifying things into the right buckets, but having buckets whose boundaries are optimized for us to efficiently condition on names to communicate useful information.
The character of the models produced by that training is another question—an empirical one. That character needn’t be consistent (even once we exclude inner alignment failures).
For the same reasons stated above, I think the fact that “simulator” doesn’t constrain the details of internal implementation is a feature, not a bug.
There is on one hand the simulation outer objective, which describes some training setups exactly. Then the question of whether a particular model should be characterized as a simulator.
To the extent that a model minimizes loss on the outer objective, it approaches being a simulator (behaviorally). Different architectures will be imperfect simulators in different ways, and generalize differently OOD. If it’s deceptively aligned, it’s not a simulator in an important sense because its behavior is not sufficient to characterize very important aspects of its nature (and its behavior may be expected to diverge from simulation in the future).
It’s true that the distinction between inner misalignment and robustness/generalization failures, and thus the distinction between flawed/biased/misgeneralizing simulators and pretend-simulators, is unclear, and seems like an important thing to become less confused about.
Even if every GPT is a simulator in some sense, I think there’s a risk of motte-and-baileying our way into trouble.
Can you give an example of what it would mean for a GPT not to be a simulator, or to not be a simulator in some sense?
[apologies on slowness—I got distracted] Granted on type hierarchy. However, I don’t think all instances of GPT need to look like they inherit from the same superclass. Perhaps there’s such a superclass, but we shouldn’t assume it.
I think most of my worry comes down to potential reasoning along the lines of:
GPT is a simulator;
Simulators have property p;
Therefore GPT has property p;
When what I think is justified is:
GPT instances are usually usefully thought of as simulators;
Simulators have property p;
We should suspect that a given instance of GPT will have property p, and confirm/falsify this;
I don’t claim you’re advocating the former: I’m claiming that people are likely to use the former if “GPT is a simulator” is something they believe. (this is what I mean by motte-and-baileying into trouble)
If you don’t mean to imply anything mechanistic by “simulator”, then I may have misunderstood you—but at that point “GPT is a simulator” doesn’t seem to get us very far.
If it’s deceptively aligned, it’s not a simulator in an important sense because its behavior is not sufficient to characterize very important aspects of its nature (and its behavior may be expected to diverge from simulation in the future).
It’s true that the distinction between inner misalignment and robustness/generalization failures, and thus the distinction between flawed/biased/misgeneralizing simulators and pretend-simulators, is unclear, and seems like an important thing to become less confused about.
I think this is the fundamental issue. Deceptive alignment aside, what else qualifies as “an important aspect of its nature”? Which aspects disqualify a model as a simulator? Which aspects count as inner misalignment?
To be clear on [x is a simulator (up to inner misalignment)], I need to know:
What is implied mechanistically (if anything) by “x is a simulator”.
What is ruled out by “(up to inner misalignment)”.
I’d be wary of assuming there’s any neat flawed-simulator/pretend-simulator distinction to be discovered. (but probably you don’t mean to imply this?) I’m all for deconfusion, but it’s possible there’s no joint at which to carve here.
(my guess would be that we’re sometimes confused by the hidden assumption: [a priori unlikely systematically misleading situation ⇒ intent to mislead] whereas we should be thinking more like [a priori unlikely systematically misleading situation ⇒ selection pressure towards things that mislead us]
I.e. looking for deception in something that systematically misleads us is like looking for the generator for beauty in something beautiful. Beauty and [systematic misleading] are relations between ourselves and the object. Selection pressure towards this relation may or may not originate in the object.)
Can you give an example of what it would mean for a GPT not to be a simulator, or to not be a simulator in some sense?
Here I meant to point to the lack of clarity around what counts as inner misalignment, and what GPT’s being a simulator would imply mechanistically (if anything).
Great post. Very interesting.
However, I think that assuming there’s a “true name” or “abstract type that GPT represents” is an error.
If GPT means “transformers trained on next-token prediction”, then GPT’s true name is just that. The character of the models produced by that training is another question—an empirical one. That character needn’t be consistent (even once we exclude inner alignment failures).
Even if every GPT is a simulator in some sense, I think there’s a risk of motte-and-baileying our way into trouble.
Things are instances of more than one true name because types are hierarchical.
GPT is a thing. GPT is an AI (a type of thing). GPT is a also ML model (a type of AI). GPT is also a simulator (a type of ML model). GPT is a generative pretrained transformer (a type of simulator). GPT-3 is a generative pretrained transformer with 175B parameters trained on a particular dataset (a type/instance of GPT).
The intention is not to rename GPT → simulator. Things that are not GPT can be simulators too. “Simulator” is a superclass of GPT.
The reason I propose “simulator” as a named category is because I think it’s useful to talk about properties of simulators more generally, like it makes sense to be able to speak of “AI alignment” and not only “GPT alignment”. We can say things like “simulators generate trajectories that evolve according to the learned conditional probabilities of the training distribution” instead of “GPTs, RNNs, LSTMs, Dalle, n-grams, and RL transition models generate trajectories that evolve according to the learned conditional probabilities of the training distribution”. The former statement also accounts for hypothetical architectures. Carving reality at its joints is not just about classifying things into the right buckets, but having buckets whose boundaries are optimized for us to efficiently condition on names to communicate useful information.
For the same reasons stated above, I think the fact that “simulator” doesn’t constrain the details of internal implementation is a feature, not a bug.
There is on one hand the simulation outer objective, which describes some training setups exactly. Then the question of whether a particular model should be characterized as a simulator.
To the extent that a model minimizes loss on the outer objective, it approaches being a simulator (behaviorally). Different architectures will be imperfect simulators in different ways, and generalize differently OOD. If it’s deceptively aligned, it’s not a simulator in an important sense because its behavior is not sufficient to characterize very important aspects of its nature (and its behavior may be expected to diverge from simulation in the future).
It’s true that the distinction between inner misalignment and robustness/generalization failures, and thus the distinction between flawed/biased/misgeneralizing simulators and pretend-simulators, is unclear, and seems like an important thing to become less confused about.
Can you give an example of what it would mean for a GPT not to be a simulator, or to not be a simulator in some sense?
[apologies on slowness—I got distracted]
Granted on type hierarchy. However, I don’t think all instances of GPT need to look like they inherit from the same superclass. Perhaps there’s such a superclass, but we shouldn’t assume it.
I think most of my worry comes down to potential reasoning along the lines of:
GPT is a simulator;
Simulators have property p;
Therefore GPT has property p;
When what I think is justified is:
GPT instances are usually usefully thought of as simulators;
Simulators have property p;
We should suspect that a given instance of GPT will have property p, and confirm/falsify this;
I don’t claim you’re advocating the former: I’m claiming that people are likely to use the former if “GPT is a simulator” is something they believe. (this is what I mean by motte-and-baileying into trouble)
If you don’t mean to imply anything mechanistic by “simulator”, then I may have misunderstood you—but at that point “GPT is a simulator” doesn’t seem to get us very far.
I think this is the fundamental issue.
Deceptive alignment aside, what else qualifies as “an important aspect of its nature”?
Which aspects disqualify a model as a simulator?
Which aspects count as inner misalignment?
To be clear on [x is a simulator (up to inner misalignment)], I need to know:
What is implied mechanistically (if anything) by “x is a simulator”.
What is ruled out by “(up to inner misalignment)”.
I’d be wary of assuming there’s any neat flawed-simulator/pretend-simulator distinction to be discovered. (but probably you don’t mean to imply this?)
I’m all for deconfusion, but it’s possible there’s no joint at which to carve here.
(my guess would be that we’re sometimes confused by the hidden assumption:
[a priori unlikely systematically misleading situation ⇒ intent to mislead]
whereas we should be thinking more like
[a priori unlikely systematically misleading situation ⇒ selection pressure towards things that mislead us]
I.e. looking for deception in something that systematically misleads us is like looking for the generator for beauty in something beautiful. Beauty and [systematic misleading] are relations between ourselves and the object. Selection pressure towards this relation may or may not originate in the object.)
Here I meant to point to the lack of clarity around what counts as inner misalignment, and what GPT’s being a simulator would imply mechanistically (if anything).
Also see this comment thread for discussion of true names and the inadequacy of “simulator”