I don’t agree with this line of reasoning because actual ML systems don’t implement either a simplicity or speed prior. If we assume away inner alignment failures, then current ML systems implement something like a speed-capped simplicity prior.
If we allow inner alignment failures, then things become FAR more complex. Self perpetuating mesa optimizer will try to influence the distribution of future hypotheses to make sure the system retains the mesa optimizer in question. The system’s “prior” thus becomes path dependent in a way that neither the speed nor simplicity priors capture at all.
A non-inner aligned system is more likely to retain hypotheses that arise earlier in training, since the mesa optimizers implementing those hypotheses can guide the system’s learning process away from direction that would remove the mesa optimizer in question. This seems like the sort of thing that could cause an unaligned AI to nonetheless have a plurality of nonhuman values that it perpetuates into the future.
I don’t agree with this line of reasoning because actual ML systems don’t implement either a simplicity or speed prior. If we assume away inner alignment failures, then current ML systems implement something like a speed-capped simplicity prior.
If we allow inner alignment failures, then things become FAR more complex. Self perpetuating mesa optimizer will try to influence the distribution of future hypotheses to make sure the system retains the mesa optimizer in question. The system’s “prior” thus becomes path dependent in a way that neither the speed nor simplicity priors capture at all.
A non-inner aligned system is more likely to retain hypotheses that arise earlier in training, since the mesa optimizers implementing those hypotheses can guide the system’s learning process away from direction that would remove the mesa optimizer in question. This seems like the sort of thing that could cause an unaligned AI to nonetheless have a plurality of nonhuman values that it perpetuates into the future.