we will select for “explicit search processes with simple objectives”
The actual argument is that small descriptions give higher parameter space volume, and so the things we find are those with short descriptions (low Kolmogorov complexity). The thing with a short description is the whole mesa-optimizer, not just its goal. This is misleading for goals because low Kolmogorov complexity doesn’t mean low “complexity” in many other senses, so an arbitrary goal with low Kolmogorov complexity would actually be much more “complicated” than intended base objective. In particular, it probably cares about the real world outside an episode and is thus motivated to exhibit deceptively aligned behavior.
I think “explicit search” is similarly misleading, because most short programs (around a given behavior) are not coherent decision theoretic optimizers. Search would only become properly explicit after the mesa-optimizer completes its own agent foundations alignment research program and builds itself a decision theory based corrigible successor. A mesa-optimizer only needs to pursue the objective of aligned behavior (or whatever it’s being selected for), and whether that tends to be its actual objective or the instrumental objective of deceptive alignment is a toss-up, either would do for it to be selected. But in either case, it doesn’t need to be anywhere ready to pursue a coherent goal of its own (as aligned behavior is also not goal directed behavior in a strong sense).
I mean relatively short, as in the argument for why overparametrized models generalize. They still do get to ~memorize all training data, but anything else comes at a premium, reduces probability of getting selected for models whose behavior depends on those additional details. (This use of “short” as meaning “could be 500 gigabytes” was rather sloppy/misleading of me, in a comment about sloppy/misleading use of words...)
Ah I think that’s the crux—I believe the overparametrized regime finds generalizing models because gradient descent finds functions that havelow function norm, not low description length. I forget the paper that showed this for neural nets but here’s a proof for logistic regression.
I’m thinking of a setting where shortest descriptions of behavior determine sets of models that exhibit matching behavior (possibly in a coarse-grained way, so distances in behavior space are relevant). This description-model relation could be arbitrarily hard to compute, so it’s OK for shortest descriptions to be shortest programs or something ridiculous like that. This gives a partition of the model/parameter space according to the mapping from models to shortest descriptions of their behavior. I think shorter shortest descriptions (simpler behaviors) fill more volume in the parameter/model space, have more models whose behavior is given by those descriptions (this is probably the crux; e.g. it’s false if behaviors are just models themselves and descriptions are exact).
Gradient descent doesn’t interact with descriptions or the description-model relation in any way, but since it selects models ~based on behavior, and starts its search from a random point in the model space, it tends to select behaviors from larger elements of the partition of the space of models that correspond to simpler behaviors with shorter shortest descriptions.
This holds at every step of gradient descent, not just when it has already learned something relevant. The argument is that whatever behavior is selected, it is relatively simple, compared to other behaviors that could’ve been selected by the same selection process. Further training just increases the selection pressure.
Yeah I think you need some additional assumptions on the models and behaviors, which you’re gesturing at with the “matching behaviors” and “inexact descriptions”. Otherwise it’s easy to find counterexamples: imagine the model is just a single N x N matrix of parameters, then in general there is no shorter description length of the behavior than the model itself.
Yes there are non-invertible (you might say “simpler”) behaviors which each occupy more parameter volume than any given invertible behavior, but random matrices are almost certainly invertible so the actual optimization pressure towards low description length is infinitesimal.
The actual argument is that small descriptions give higher parameter space volume, and so the things we find are those with short descriptions (low Kolmogorov complexity). The thing with a short description is the whole mesa-optimizer, not just its goal. This is misleading for goals because low Kolmogorov complexity doesn’t mean low “complexity” in many other senses, so an arbitrary goal with low Kolmogorov complexity would actually be much more “complicated” than intended base objective. In particular, it probably cares about the real world outside an episode and is thus motivated to exhibit deceptively aligned behavior.
I think “explicit search” is similarly misleading, because most short programs (around a given behavior) are not coherent decision theoretic optimizers. Search would only become properly explicit after the mesa-optimizer completes its own agent foundations alignment research program and builds itself a decision theory based corrigible successor. A mesa-optimizer only needs to pursue the objective of aligned behavior (or whatever it’s being selected for), and whether that tends to be its actual objective or the instrumental objective of deceptive alignment is a toss-up, either would do for it to be selected. But in either case, it doesn’t need to be anywhere ready to pursue a coherent goal of its own (as aligned behavior is also not goal directed behavior in a strong sense).
Agreed on “explicit search” being a misleading phrase, I’ll replace it with just “search” when I’m referring to learned programs.
I don’t think I understand this. GPT-3 is a thing we found, which has 175B parameters, what is the short description of it?
I mean relatively short, as in the argument for why overparametrized models generalize. They still do get to ~memorize all training data, but anything else comes at a premium, reduces probability of getting selected for models whose behavior depends on those additional details. (This use of “short” as meaning “could be 500 gigabytes” was rather sloppy/misleading of me, in a comment about sloppy/misleading use of words...)
Ah I think that’s the crux—I believe the overparametrized regime finds generalizing models because gradient descent finds functions that have low function norm, not low description length. I forget the paper that showed this for neural nets but here’s a proof for logistic regression.
I’m thinking of a setting where shortest descriptions of behavior determine sets of models that exhibit matching behavior (possibly in a coarse-grained way, so distances in behavior space are relevant). This description-model relation could be arbitrarily hard to compute, so it’s OK for shortest descriptions to be shortest programs or something ridiculous like that. This gives a partition of the model/parameter space according to the mapping from models to shortest descriptions of their behavior. I think shorter shortest descriptions (simpler behaviors) fill more volume in the parameter/model space, have more models whose behavior is given by those descriptions (this is probably the crux; e.g. it’s false if behaviors are just models themselves and descriptions are exact).
Gradient descent doesn’t interact with descriptions or the description-model relation in any way, but since it selects models ~based on behavior, and starts its search from a random point in the model space, it tends to select behaviors from larger elements of the partition of the space of models that correspond to simpler behaviors with shorter shortest descriptions.
This holds at every step of gradient descent, not just when it has already learned something relevant. The argument is that whatever behavior is selected, it is relatively simple, compared to other behaviors that could’ve been selected by the same selection process. Further training just increases the selection pressure.
Yeah I think you need some additional assumptions on the models and behaviors, which you’re gesturing at with the “matching behaviors” and “inexact descriptions”. Otherwise it’s easy to find counterexamples: imagine the model is just a single N x N matrix of parameters, then in general there is no shorter description length of the behavior than the model itself.
Yes there are non-invertible (you might say “simpler”) behaviors which each occupy more parameter volume than any given invertible behavior, but random matrices are almost certainly invertible so the actual optimization pressure towards low description length is infinitesimal.