Thanks. That did actually occur to me, but I left it out because I wasn’t sure and didn’t want to go on an exhausting chase down every possible interpretation of the paper.
Anyway, if the input to the Prosociality Score Model is a set of latent variables rather than a set of pixels then:
My OP claim that there are two adversarial out-of-distribution generalization problems (in the absence of some creative solution not in the paper) is still true.
One of those two problems (OOD generalization of the Prosociality Score Model) might get less bad, although I don’t see why it would go away altogether.
…But only if the labels are correct, and the labeling problem is potentially much harder now, because the latent variables include inscrutable information about “how the AI is thinking about / conceptualizing the things that it’s seeing / doing”. I think. And if they do, then how are the humans supposed to label them as good or bad? Like, if the AI notices someone feeling physically good but psychologically distressed, we want to label it as low-energy when the AI is thinking about the former aspect and high-energy if the AI is thinking about the latter aspect, I imagine. And then we start getting into nasty neural net interpretability challenges.
Also, aren’t the latent variables changing as we go, thanks to self-supervised learning? But the Intrinsic Cost Module is supposed to be immutable. I’m confused about how this is supposed to work.
But only if the labels are correct, and the labeling problem is potentially much harder now, because the latent variables include inscrutable information about “how the AI is thinking about / conceptualizing the things that it’s seeing / doing”. I think. And if they do, then how are the humans supposed to label them as good or bad? Like, if the AI notices someone feeling physically good but psychologically distressed, we want to label it as low-energy when the AI is thinking about the former aspect and high-energy if the AI is thinking about the latter aspect, I imagine. And then we start getting into nasty neural net interpretability challenges.
I believe the idea was (though this could be a false memory of my interpretation because I cannot find it in the paper now) that separate NNs would be trained to interpret world model representations (states) as human-understandable text. At the very least, this could make the brute size of the interpretability task much smaller because interpreting the internal activations and circuits of the world model or actor NNs producing these representations might not matter. This is related to Bengio’s idea of separating knowledge/models from the “inference” network, where the latter could be much larger. This idea is, of course, embodied in GFlowNets, and indeed it seems to me that they could be used in place of the “world model” in the APTAMI (although LeCun doesn’t like H-JEPA to be probabilistic).
However, I don’t know how to bootstrap the training of these interpretability modules for representations alongside the critic, but this doesn’t seem like an insurmountable engineering problem.
Also, aren’t the latent variables changing as we go, thanks to self-supervised learning? But the Intrinsic Cost Module is supposed to be immutable. I’m confused about how this is supposed to work.
To me, “mutability” / online trainability of various parts of the architecture, or AI as a whole seems more like a spectrum than a binary distinction, and in fact not a very important one. You can always think of “immutable” modules as just “learning very slowly, on the level of AI engineering/training iterations”, a.k.a. “model selection”.
So, the “immutability” of the intrinsic cost just highlights that these things should be inferred “very slowly, with governance and caution”, but you can basically configure the learning rate of other parts of the architecture, such as the world model, to also be very low, so that all the surrounding infrastructure: trainable critic modules, the “altruism controller” that I proposed, interpretability modules, alignment processes and protocols, and, finally, the human understanding (i.e., “theories of AI mind” in human heads) all can “track” the changes in the world model and its representations sufficiently closely so that they don’t “fall out of alignment” with them.
Actually, the fact that NNs in human brains en masse should track at least some changes in AI’s world models “in time” and not fall out of alignment with them practically means that the rate of learning in the world models on the high levels of the JEPA hierarchy should be so slow that they de facto could be “immutable” on the level of a particular AI “version”/iteration, and only “learned”/updated between versions, as well as the Intrinsic Cost module. Moreover, it might need to be much slower than even the typical technology release cadence (such as “a new version of AI every 6 months”), e.g., not changing much on any 5-year interval, which seems to me the minimum realistic time in which humans en masse could adjust to a new kind of alien mind around them.
Thanks. That did actually occur to me, but I left it out because I wasn’t sure and didn’t want to go on an exhausting chase down every possible interpretation of the paper.
Anyway, if the input to the Prosociality Score Model is a set of latent variables rather than a set of pixels then:
My OP claim that there are two adversarial out-of-distribution generalization problems (in the absence of some creative solution not in the paper) is still true.
One of those two problems (OOD generalization of the Prosociality Score Model) might get less bad, although I don’t see why it would go away altogether.
…But only if the labels are correct, and the labeling problem is potentially much harder now, because the latent variables include inscrutable information about “how the AI is thinking about / conceptualizing the things that it’s seeing / doing”. I think. And if they do, then how are the humans supposed to label them as good or bad? Like, if the AI notices someone feeling physically good but psychologically distressed, we want to label it as low-energy when the AI is thinking about the former aspect and high-energy if the AI is thinking about the latter aspect, I imagine. And then we start getting into nasty neural net interpretability challenges.
Also, aren’t the latent variables changing as we go, thanks to self-supervised learning? But the Intrinsic Cost Module is supposed to be immutable. I’m confused about how this is supposed to work.
I believe the idea was (though this could be a false memory of my interpretation because I cannot find it in the paper now) that separate NNs would be trained to interpret world model representations (states) as human-understandable text. At the very least, this could make the brute size of the interpretability task much smaller because interpreting the internal activations and circuits of the world model or actor NNs producing these representations might not matter. This is related to Bengio’s idea of separating knowledge/models from the “inference” network, where the latter could be much larger. This idea is, of course, embodied in GFlowNets, and indeed it seems to me that they could be used in place of the “world model” in the APTAMI (although LeCun doesn’t like H-JEPA to be probabilistic).
However, I don’t know how to bootstrap the training of these interpretability modules for representations alongside the critic, but this doesn’t seem like an insurmountable engineering problem.
To me, “mutability” / online trainability of various parts of the architecture, or AI as a whole seems more like a spectrum than a binary distinction, and in fact not a very important one. You can always think of “immutable” modules as just “learning very slowly, on the level of AI engineering/training iterations”, a.k.a. “model selection”.
So, the “immutability” of the intrinsic cost just highlights that these things should be inferred “very slowly, with governance and caution”, but you can basically configure the learning rate of other parts of the architecture, such as the world model, to also be very low, so that all the surrounding infrastructure: trainable critic modules, the “altruism controller” that I proposed, interpretability modules, alignment processes and protocols, and, finally, the human understanding (i.e., “theories of AI mind” in human heads) all can “track” the changes in the world model and its representations sufficiently closely so that they don’t “fall out of alignment” with them.
Actually, the fact that NNs in human brains en masse should track at least some changes in AI’s world models “in time” and not fall out of alignment with them practically means that the rate of learning in the world models on the high levels of the JEPA hierarchy should be so slow that they de facto could be “immutable” on the level of a particular AI “version”/iteration, and only “learned”/updated between versions, as well as the Intrinsic Cost module. Moreover, it might need to be much slower than even the typical technology release cadence (such as “a new version of AI every 6 months”), e.g., not changing much on any 5-year interval, which seems to me the minimum realistic time in which humans en masse could adjust to a new kind of alien mind around them.
Update: I wrote a big article “Aligning an H-JEPA agent via training on the outputs of an LLM-based “exemplary actor” in which I develop the thinking behind the comment above (but also update it significantly).