It feels like if the agent is generally intelligent enough hinge beliefs could be reasoned/fine-tuned against for the purposes of a better model of the world. This would mean that the priors from the hinge beliefs would still be present but the free parameters would update to try to account for them at least on a conceptual level. Examples would include general relativity, quantum mechanics and potentially even paraconsistent logic for which some humans have tried to update their free parameters to account as much as possible for their hinge beliefs for the purpose of better modelling the world (we should expect this in AGI as it is an instrumentally convergent goal). Moreover, a sufficiently capable agent could self-modify to get rid of the limiting hinge beliefs for the same reasons. This problem could be averted if the hinge beliefs/priors were defining the agent’s goals but goals seem to be fairly specific and about concepts in a world model but hinge beliefs tend to be more general eg. how those concepts relate. Therefore, I’m uncertain how stable alignment solutions that rely on hinge beliefs would be.
It feels like if the agent is generally intelligent enough hinge beliefs could be reasoned/fine-tuned against for the purposes of a better model of the world. This would mean that the priors from the hinge beliefs would still be present but the free parameters would update to try to account for them at least on a conceptual level. Examples would include general relativity, quantum mechanics and potentially even paraconsistent logic for which some humans have tried to update their free parameters to account as much as possible for their hinge beliefs for the purpose of better modelling the world (we should expect this in AGI as it is an instrumentally convergent goal). Moreover, a sufficiently capable agent could self-modify to get rid of the limiting hinge beliefs for the same reasons. This problem could be averted if the hinge beliefs/priors were defining the agent’s goals but goals seem to be fairly specific and about concepts in a world model but hinge beliefs tend to be more general eg. how those concepts relate. Therefore, I’m uncertain how stable alignment solutions that rely on hinge beliefs would be.