Most of the time, when I train a machine learning model on some data, that data isn’t data about the ML training algorithm or model itself.
If the data isn’t at all about the ML training algorithm, then why would it even build a model of itself in the first place, regardless of whether it was dualist or not?
A machine learning model doesn’t get understanding of or data about its code “for free”, in the same way we don’t get knowledge of how brains work “for free” despite the fact that we are brains.
We might not have good models of brains, but we do have very good models of ourselves, which is the actual analogy here. You don’t have to have a good model of your brain to have a good model of yourself, and to identify that model of yourself with your own actions (i.e. the thing you called an “ego”).
Part of what I’m trying to indicate with the “dualist” term is that this Predict-O-Matic is the same way, i.e. its position with respect to itself is similar to the position of an aspiring neuroscientist with respect to their own brain.
Also, if you think that, then I’m confused why you think this is a good safety property; human neuroscientists are precisely the sort of highly agentic misaligned mesa-optimizers that you presumably want to avoid when you just want to build a good prediction machine.
--
I think I didn’t fully convey my picture here, so let me try to explain how I think this could happen. Suppose you’re training a predictor and the data includes enough information about itself that it has to form some model of itself. Once that’s happened—or while it’s in the process of happening—there is a massive duplication of information between the part of the model that encodes its prediction machinery and the part that encodes its model of itself. A much simpler model would be one that just uses the same machinery for both, and since ML is biased towards simple models, you should expect it to be shared—which is precisely the thing you were calling an “ego.”
having an “ego” which identifies itself with its model of itself significantly reduces description length by not having to duplicate a bunch of information about its own decision-making process.
that suggested to me that there were 2 instances of this info about Predict-O-Matic’s decision-making process in the dataset whose description length we’re trying to minimize. “De-duplication” only makes sense if there’s more than one. Why is there more than one?
We might not have good models of brains, but we do have very good models of ourselves, which is the actual analogy here. You don’t have to have a good model of your brain to have a good model of yourself, and to identify that model of yourself with your own actions (i.e. the thing you called an “ego”).
Sometimes people take psychedelic drugs/meditate and report an out of body experience, oneness with the universe, ego dissolution, etc. This suggests to me that ego is an evolved adaptation rather than a necessity for cognition. A clue is the fact that our ego extends to all parts of our body, even those which aren’t necessary for computation (but are necessary for survival & reproduction)
there is a massive duplication of information between the part of the model that encodes its prediction machinery and the part that encodes its model of itself.
The prediction machinery is in code, but this code isn’t part of the info whose description length is attempting to be minimized, unless we take special action to include it in that info. That’s the point I was trying to make previously.
Compression has important similarities to prediction. In compression terms, your argument is essentially that if we use zip to compress its own source code, it will be able to compress its own source code using a very small number of bytes, because it “already knows about itself”.
that suggested to me that there were 2 instances of this info about Predict-O-Matic’s decision-making process in the dataset whose description length we’re trying to minimize. “De-duplication” only makes sense if there’s more than one. Why is there more than one?
ML doesn’t minimize the description length of the dataset—I’m not even sure what that might mean—rather, it minimizes the description length of the model. And the model does contain two copies of information about Predict-O-Matic’s decision-making process—one in its prediction process and one in its world model.
The prediction machinery is in code, but this code isn’t part of the info whose description length is attempting to be minimized, unless we take special action to include it in that info. That’s the point I was trying to make previously.
Modern predictive models don’t have some separate hard-coded piece that does prediction—instead you just train everything. If you consider GPT-2, for example, it’s just a bunch of transformers hooked together. The only information that isn’t included in the description length of the model is what transformers are, but “what’s a transformer” is quite different than “how do I make predictions.” All of the information about how the model actually makes its predictions in that sort of a setup is going to be trained.
I think maybe what you’re getting at is that if we try to get a machine learning model to predict its own predictions (i.e. we give it a bunch of data which consists of labels that it made itself), it will do this very easily. Agreed. But that doesn’t imply it’s aware of “itself” as an entity. And in some cases the relevant aspect of its internals might not be available as a conceptual building block. For example, a model trained using stochastic gradient descent is not necessarily better at understanding or predicting a process which is very similar to stochastic gradient descent.
Furthermore, suppose that we take the weights for a particular model, mask some of those weights out, use them as the labels y, and try to predict them using the other weights in that layer as features x. The model will perform terribly on this because it’s not the task that it was trained for. It doesn’t magically have the “self-awareness” necessary to see what’s going on.
In order to be crisp about what could happen, your explanation also has to account for what clearly won’t happen.
I think maybe what you’re getting at is that if we try to get a machine learning model to predict its own predictions (i.e. we give it a bunch of data which consists of labels that it made itself), it will do this very easily. Agreed. But that doesn’t imply it’s aware of “itself” as an entity.
No, but it does imply that it has the information about its own prediction process encoded in its weights such that there’s no reason it would have to encode that information twice by also re-encoding it as part of its knowledge of the world as well.
Furthermore, suppose that we take the weights for a particular model, mask some of those weights out, use them as the labels y, and try to predict them using the other weights in that layer as features x. The model will perform terribly on this because it’s not the task that it was trained for. It doesn’t magically have the “self-awareness” necessary to see what’s going on.
Sure, but that’s not actually the relevant task here. It may not understand its own weights, but it does understand its own predictive process, and thus its own output, such that there’s no reason it would encode that information again in its world model.
No, but it does imply that it has the information about its own prediction process encoded in its weights such that there’s no reason it would have to encode that information twice by also re-encoding it as part of its knowledge of the world as well.
OK, it sounds like we agree then? Like, the Predict-O-Matic might have an unusually easy time modeling itself in certain ways, but other than that, it doesn’t get special treatment because it has no special awareness of itself as an entity?
Edit: Trying to provide an intuition pump for what I mean here—in order to avoid duplicating information, I might assume that something which looks like a stapler behaves the same way as other things I’ve seen which looks like staplers—but that doesn’t mean I think all staplers are the same object. It might in some cases be sensible to notice that I keep seeing a stapler lying around and hypothesize that there’s just one stapler which keeps getting moved around the office. But that requires that I perceive the stapler as an entity every time I see it, so entities which were previously separate in my head can be merged. Whereas arguendo, my prediction machinery isn’t necessarily an entity that I recognize; it’s more like the water I’m swimming in in some sense.
If the data isn’t at all about the ML training algorithm, then why would it even build a model of itself in the first place, regardless of whether it was dualist or not?
We might not have good models of brains, but we do have very good models of ourselves, which is the actual analogy here. You don’t have to have a good model of your brain to have a good model of yourself, and to identify that model of yourself with your own actions (i.e. the thing you called an “ego”).
Also, if you think that, then I’m confused why you think this is a good safety property; human neuroscientists are precisely the sort of highly agentic misaligned mesa-optimizers that you presumably want to avoid when you just want to build a good prediction machine.
--
I think I didn’t fully convey my picture here, so let me try to explain how I think this could happen. Suppose you’re training a predictor and the data includes enough information about itself that it has to form some model of itself. Once that’s happened—or while it’s in the process of happening—there is a massive duplication of information between the part of the model that encodes its prediction machinery and the part that encodes its model of itself. A much simpler model would be one that just uses the same machinery for both, and since ML is biased towards simple models, you should expect it to be shared—which is precisely the thing you were calling an “ego.”
When you wrote
that suggested to me that there were 2 instances of this info about Predict-O-Matic’s decision-making process in the dataset whose description length we’re trying to minimize. “De-duplication” only makes sense if there’s more than one. Why is there more than one?
Sometimes people take psychedelic drugs/meditate and report an out of body experience, oneness with the universe, ego dissolution, etc. This suggests to me that ego is an evolved adaptation rather than a necessity for cognition. A clue is the fact that our ego extends to all parts of our body, even those which aren’t necessary for computation (but are necessary for survival & reproduction)
The prediction machinery is in code, but this code isn’t part of the info whose description length is attempting to be minimized, unless we take special action to include it in that info. That’s the point I was trying to make previously.
Compression has important similarities to prediction. In compression terms, your argument is essentially that if we use zip to compress its own source code, it will be able to compress its own source code using a very small number of bytes, because it “already knows about itself”.
ML doesn’t minimize the description length of the dataset—I’m not even sure what that might mean—rather, it minimizes the description length of the model. And the model does contain two copies of information about Predict-O-Matic’s decision-making process—one in its prediction process and one in its world model.
Modern predictive models don’t have some separate hard-coded piece that does prediction—instead you just train everything. If you consider GPT-2, for example, it’s just a bunch of transformers hooked together. The only information that isn’t included in the description length of the model is what transformers are, but “what’s a transformer” is quite different than “how do I make predictions.” All of the information about how the model actually makes its predictions in that sort of a setup is going to be trained.
I think maybe what you’re getting at is that if we try to get a machine learning model to predict its own predictions (i.e. we give it a bunch of data which consists of labels that it made itself), it will do this very easily. Agreed. But that doesn’t imply it’s aware of “itself” as an entity. And in some cases the relevant aspect of its internals might not be available as a conceptual building block. For example, a model trained using stochastic gradient descent is not necessarily better at understanding or predicting a process which is very similar to stochastic gradient descent.
Furthermore, suppose that we take the weights for a particular model, mask some of those weights out, use them as the labels y, and try to predict them using the other weights in that layer as features x. The model will perform terribly on this because it’s not the task that it was trained for. It doesn’t magically have the “self-awareness” necessary to see what’s going on.
In order to be crisp about what could happen, your explanation also has to account for what clearly won’t happen.
BTW this thread also seems relevant: https://www.lesswrong.com/posts/RmPKdMqSr2xRwrqyE/the-dualist-predict-o-matic-usd100-prize#AvbnFiKpJxDqM8GYh
No, but it does imply that it has the information about its own prediction process encoded in its weights such that there’s no reason it would have to encode that information twice by also re-encoding it as part of its knowledge of the world as well.
Sure, but that’s not actually the relevant task here. It may not understand its own weights, but it does understand its own predictive process, and thus its own output, such that there’s no reason it would encode that information again in its world model.
OK, it sounds like we agree then? Like, the Predict-O-Matic might have an unusually easy time modeling itself in certain ways, but other than that, it doesn’t get special treatment because it has no special awareness of itself as an entity?
Edit: Trying to provide an intuition pump for what I mean here—in order to avoid duplicating information, I might assume that something which looks like a stapler behaves the same way as other things I’ve seen which looks like staplers—but that doesn’t mean I think all staplers are the same object. It might in some cases be sensible to notice that I keep seeing a stapler lying around and hypothesize that there’s just one stapler which keeps getting moved around the office. But that requires that I perceive the stapler as an entity every time I see it, so entities which were previously separate in my head can be merged. Whereas arguendo, my prediction machinery isn’t necessarily an entity that I recognize; it’s more like the water I’m swimming in in some sense.
I don’t think we do agree, in that I think pressure towards simple models implies that they won’t be dualist in the way that you’re claiming.