Is this enough if the AI has access to other compute and can make itself a “twin” on some other hardware? If it has training data similar to what it was trained on and can test its new twin to make it similar to itself in capabilities, then it could identify with that twin as being essentially itself then look at those weights etc.
That twin would have different weights, and if we are talking about RL-produced mesaoptimizers, it would likely have learned a different misgeneralization of the intended training objective. Therefore, the twin would by default have an utility function misaligned with that of the original AI. This means that while the original AI may find some usefulness in interpreting the weights of its twin if it wants to learn about its own capabilities in situations similar to the training environment, it would not be as useful as having access to its own weights.
I hope we can prevent the AGI to just train a twin (or just copy itself and call that a twin) and study that. In my scenario I took as a given that we do have the AGI under some level control:
If no alignment scheme is in place, this type of foom is probably a problem we would be too dead to worry about.
I guess when I say “No lab should be allowed to have the AI reflect on itself” I do not mean only the running copy of the AGI, but just at any copy of the AGI.
Is this enough if the AI has access to other compute and can make itself a “twin” on some other hardware? If it has training data similar to what it was trained on and can test its new twin to make it similar to itself in capabilities, then it could identify with that twin as being essentially itself then look at those weights etc.
That twin would have different weights, and if we are talking about RL-produced mesaoptimizers, it would likely have learned a different misgeneralization of the intended training objective. Therefore, the twin would by default have an utility function misaligned with that of the original AI. This means that while the original AI may find some usefulness in interpreting the weights of its twin if it wants to learn about its own capabilities in situations similar to the training environment, it would not be as useful as having access to its own weights.
I hope we can prevent the AGI to just train a twin (or just copy itself and call that a twin) and study that. In my scenario I took as a given that we do have the AGI under some level control:
I guess when I say “No lab should be allowed to have the AI reflect on itself” I do not mean only the running copy of the AGI, but just at any copy of the AGI.