Nice post. I agree that a crucial part of AGI alignment should involve routing an AI’s knowledge of human values to its own internal motivational circuitry, such that as its knowledge of human needs/goals/drives/preferences grows, so too does its alignment to those things. One key to this part of the problem may be to build in structural and inductive biases that steer the AI toward less inscrutable models.
I would say that to “know” something necessitates being able to make accurate predictions related to that thing. For most learning systems, this would imply developing some sort of generative or predictive model of its training data. In your dog/fish example, this might be realized with something like a conditionalGAN, maybe combined with an autoencoder, where “knowing” the class of a sample allows the model to predict features of the sample (e.g., “fish” class → there will be fins about here and scales about here; “dog” class → there will be three eyes on the face, furry texture on the body, etc.). Combining the class label with some sort of latent-space representation should enable it to closely reproduce the full image.
The “knowledge” here is contained less in the class labels and latent space representations and more in the parameters and structure of the generative model, which is where it actually learned the generative/causal structure of its training data. This kind of knowledge allows such models to do things like inpainting, denoising, super-resolution, and animation of an image, generating information that was not in its inputs but that it predicts “ought” to be there based on what it has learned before.
This idea is also related to the predictive coding theory of the brain, where perception happens by constantly trying to generate predictions of what the senses will receive and continuously updating based on prediction errors. Again, “knowledge” exists in the generative models and causal graphs that the brain uses to make these predictions.
Thanks! I get your arguments about “knowledge” being restricted to predictive domains, but I think it’s (mostly) just a semantic issue. I also don’t think the specifics of the word “knowledge” are particularly important to my points which is what I attempted to clarify at the start, but I’ve clearly typical-minded and assumed that of course everyone would agree with me about a dog/fish classifier having “knowledge”, when it’s more of an edge-case than I thought! Perhaps a better version of this post would have either tabooed “knowledge” altogether or picked a more obviously-knowledge-having model.
Well, it certainly has mutual information with the training data, even if it only acts as a classifier (actually, classifiers can be seen as inverse generative models, so there is some generative-ish information there, as well). From that perspective, your arguments certainly hold. Although, I’m not sure if “mutual information” is precisely what you’re going for, either. Yes, I agree, I should have tabooed “knowledge” in how I read it.
Nice post. I agree that a crucial part of AGI alignment should involve routing an AI’s knowledge of human values to its own internal motivational circuitry, such that as its knowledge of human needs/goals/drives/preferences grows, so too does its alignment to those things. One key to this part of the problem may be to build in structural and inductive biases that steer the AI toward less inscrutable models.
I would say that to “know” something necessitates being able to make accurate predictions related to that thing. For most learning systems, this would imply developing some sort of generative or predictive model of its training data. In your dog/fish example, this might be realized with something like a conditional GAN, maybe combined with an autoencoder, where “knowing” the class of a sample allows the model to predict features of the sample (e.g., “fish” class → there will be fins about here and scales about here; “dog” class → there will be three eyes on the face, furry texture on the body, etc.). Combining the class label with some sort of latent-space representation should enable it to closely reproduce the full image.
The “knowledge” here is contained less in the class labels and latent space representations and more in the parameters and structure of the generative model, which is where it actually learned the generative/causal structure of its training data. This kind of knowledge allows such models to do things like inpainting, denoising, super-resolution, and animation of an image, generating information that was not in its inputs but that it predicts “ought” to be there based on what it has learned before.
This idea is also related to the predictive coding theory of the brain, where perception happens by constantly trying to generate predictions of what the senses will receive and continuously updating based on prediction errors. Again, “knowledge” exists in the generative models and causal graphs that the brain uses to make these predictions.
Thanks! I get your arguments about “knowledge” being restricted to predictive domains, but I think it’s (mostly) just a semantic issue. I also don’t think the specifics of the word “knowledge” are particularly important to my points which is what I attempted to clarify at the start, but I’ve clearly typical-minded and assumed that of course everyone would agree with me about a dog/fish classifier having “knowledge”, when it’s more of an edge-case than I thought! Perhaps a better version of this post would have either tabooed “knowledge” altogether or picked a more obviously-knowledge-having model.
Well, it certainly has mutual information with the training data, even if it only acts as a classifier (actually, classifiers can be seen as inverse generative models, so there is some generative-ish information there, as well). From that perspective, your arguments certainly hold. Although, I’m not sure if “mutual information” is precisely what you’re going for, either. Yes, I agree, I should have tabooed “knowledge” in how I read it.