The motivation is that we want a flexible and learnable posterior.
-Paul Christiano, 2020
Ahem, back on topic, I’m not totally sure what actually distinguishes f and Z, especially once you start jointly optimizing them. If f incorporates background knowledge about the world, it can do better at prediction tasks. Normally we imagine f having many more parameters than Z, and so being more likely to squirrel away extra facts, but if Z is large then we might imagine it containing computationally interesting artifacts like patterns that are designed to train a trainable f on background knowledge in a way that doesn’t look much like human-written text.
Now, maybe you can try to ensure that Z is at least somewhat textlike via making sure it’s not too easy for a discriminator to tell from human text, or requiring it to play some functional role in a pure text generator, or whatever. There will still be some human-incomprehensible bits that can be transmitted through Z (Because otherwise you’d need a discriminator so good that Z couldn’t be superhuman), but at least the amount is sharply limited.
But I’m really lost on how your could hope to limit the f side of this dichotomy. Penalize it for understanding the world too well given a random Z? Now it just has an incentive to notice random Zs and “play dead.” Somehow you want it not to do better by just becoming a catch-all model of the training data, even on the actual training data. This might be one of those philosophical problems, given that you’re expecting it to interpret natural language passages, and the lack of bright line between “understanding natural language” and “modeling the world.”
I’m not totally sure what actually distinguishes f and Z, especially once you start jointly optimizing them. If f incorporates background knowledge about the world, it can do better at prediction tasks. Normally we imagine f having many more parameters than Z, and so being more likely to squirrel away extra facts, but if Z is large then we might imagine it containing computationally interesting artifacts like patterns that are designed to train a trainable f on background knowledge in a way that doesn’t look much like human-written text.
f is just predicting P(y|x, Z), it’s not trying to model D. So you don’t gain anything by putting facts about the data distribution in f—you have to put them in Z so that it changes P(y|x,Z).
Now, maybe you can try to ensure that Z is at least somewhat textlike via making sure it’s not too easy for a discriminator to tell from human text, or requiring it to play some functional role in a pure text generator, or whatever.
The only thing Z does is get handed to the human for computing P(y|x,Z).
Ah, I think I see, thanks for explaining. So even when you talk about amplifying f, you mean a certain way of extending human predictions to more complicated background information (e.g. via breaking down Z into chunks and then using copies of f that have been trained on smaller Z), not fine-tuning f to make better predictions. Or maybe some amount of fine-tuning for “better” predictions by some method of eliciting its own standards, but not by actually comparing it to the ground truth.
This (along with eventually reading your companion post) also helps resolve the confusion I was having over what exactly was the prior in “learning the prior”—Z is just like a latent space, and f is the decoder from Z to predictions. My impression is that your hope is that if Z and f start out human-like, then this is like specifying the “programming language” of a universal prior, so that search for highly-predictive Z, decoded through f, will give something that uses human concepts in predicting the world.
So even when you talk about amplifying f, you mean a certain way of extending human predictions to more complicated background information (e.g. via breaking down Z into chunks and then using copies of f that have been trained on smaller Z), not fine-tuning f to make better predictions.
That’s right, f is either imitating a human, or it’s trained by iterated amplification / debate—in any case the loss function is defined by the human. In no case is f optimized to make good predictions about the underlying data.
My impression is that your hope is that if Z and f start out human-like, then this is like specifying the “programming language” of a universal prior, so that search for highly-predictive Z, decoded through f, will give something that uses human concepts in predicting the world.
Z should always be a human-readable (or amplified-human-readable) latent; it will necessarily remain human-readable because it has no purpose other than to help a human make predictions. f is going to remain human-like because it’s predicting what the human would say (or what the human-consulting-f would say etc.).
The amplified human is like the programming language of the universal prior, Z is like the program that is chosen (or slightly more precisely: Z is like a distribution over programs, described in a human-comprehensible way) and f is an efficient distillation of the intractable ideal.
-Paul Christiano, 2020
Ahem, back on topic, I’m not totally sure what actually distinguishes f and Z, especially once you start jointly optimizing them. If f incorporates background knowledge about the world, it can do better at prediction tasks. Normally we imagine f having many more parameters than Z, and so being more likely to squirrel away extra facts, but if Z is large then we might imagine it containing computationally interesting artifacts like patterns that are designed to train a trainable f on background knowledge in a way that doesn’t look much like human-written text.
Now, maybe you can try to ensure that Z is at least somewhat textlike via making sure it’s not too easy for a discriminator to tell from human text, or requiring it to play some functional role in a pure text generator, or whatever. There will still be some human-incomprehensible bits that can be transmitted through Z (Because otherwise you’d need a discriminator so good that Z couldn’t be superhuman), but at least the amount is sharply limited.
But I’m really lost on how your could hope to limit the f side of this dichotomy. Penalize it for understanding the world too well given a random Z? Now it just has an incentive to notice random Zs and “play dead.” Somehow you want it not to do better by just becoming a catch-all model of the training data, even on the actual training data. This might be one of those philosophical problems, given that you’re expecting it to interpret natural language passages, and the lack of bright line between “understanding natural language” and “modeling the world.”
f is just predicting P(y|x, Z), it’s not trying to model D. So you don’t gain anything by putting facts about the data distribution in f—you have to put them in Z so that it changes P(y|x,Z).
The only thing Z does is get handed to the human for computing P(y|x,Z).
Ah, I think I see, thanks for explaining. So even when you talk about amplifying f, you mean a certain way of extending human predictions to more complicated background information (e.g. via breaking down Z into chunks and then using copies of f that have been trained on smaller Z), not fine-tuning f to make better predictions. Or maybe some amount of fine-tuning for “better” predictions by some method of eliciting its own standards, but not by actually comparing it to the ground truth.
This (along with eventually reading your companion post) also helps resolve the confusion I was having over what exactly was the prior in “learning the prior”—Z is just like a latent space, and f is the decoder from Z to predictions. My impression is that your hope is that if Z and f start out human-like, then this is like specifying the “programming language” of a universal prior, so that search for highly-predictive Z, decoded through f, will give something that uses human concepts in predicting the world.
Is that somewhat in the right ballpark?
That’s right, f is either imitating a human, or it’s trained by iterated amplification / debate—in any case the loss function is defined by the human. In no case is f optimized to make good predictions about the underlying data.
Z should always be a human-readable (or amplified-human-readable) latent; it will necessarily remain human-readable because it has no purpose other than to help a human make predictions. f is going to remain human-like because it’s predicting what the human would say (or what the human-consulting-f would say etc.).
The amplified human is like the programming language of the universal prior, Z is like the program that is chosen (or slightly more precisely: Z is like a distribution over programs, described in a human-comprehensible way) and f is an efficient distillation of the intractable ideal.