But even if it is, this thing is far less naturally useful for predicting the future human behaviour than the algorithm the human actually implements!
I see why this might be true for an LLM trained with a purely predictive loss, but I have a hard time believing that the same will be true for an LLM that is grounded. I imagine that LLMs will eventually be trained to perform some sort of in-context adaptation to a new environment while receiving a reward signal from a human in the loop. Models that learn to maximize the approval of some human prompting it to cut down a tree will inevitably have to learn human ontology to real-world correspondence. I imagine that ELK aka “dereferencing the values pointer in your head” will just be a capability that the model will just acquire, and retargeting the search just means identifying this mechanism and verifying that it is robust.
Does this differ from your model of superhuman AI development?
I don’t see how this changes the picture? If you train a model on real time feedback from a human, that human algorithm is still the same one that is foolable by i.e cutting down the tree and replacing it with papier-mache or something. None of this forces the model to learn a correspondence between the human ontology and the model’s internal best-guess model because the reason any of this is a problem in the first place is the fact that the human algorithm points at a thing which is not the thing we actually care about.
I see why this might be true for an LLM trained with a purely predictive loss, but I have a hard time believing that the same will be true for an LLM that is grounded. I imagine that LLMs will eventually be trained to perform some sort of in-context adaptation to a new environment while receiving a reward signal from a human in the loop. Models that learn to maximize the approval of some human prompting it to cut down a tree will inevitably have to learn human ontology to real-world correspondence. I imagine that ELK aka “dereferencing the values pointer in your head” will just be a capability that the model will just acquire, and retargeting the search just means identifying this mechanism and verifying that it is robust.
Does this differ from your model of superhuman AI development?
I don’t see how this changes the picture? If you train a model on real time feedback from a human, that human algorithm is still the same one that is foolable by i.e cutting down the tree and replacing it with papier-mache or something. None of this forces the model to learn a correspondence between the human ontology and the model’s internal best-guess model because the reason any of this is a problem in the first place is the fact that the human algorithm points at a thing which is not the thing we actually care about.