Oracles must somehow be incentivized to give useful answers
A microscope model must also be trained somehow, for example with unsupervised learning. Therefore, I expect such a model to also look like it’s “incentivized to give useful answers” (e.g. an answer to the question: “what is the next word in the text?”).
My understanding is that what distinguishes a microscope model is the way it is being used after it’s already trained (namely, allowing researchers to look at its internals for the purpose of gaining insights etcetera, rather than making inferences for the sake of using its valuable output). If this is correct, it seems that we should only use safe training procedures for the purpose of training useful microscopes, rather than training arbitrarily capable models.
Our usual objective is “Make it safe, and if we aligned it correctly make it useful.”. A microscope is useful even if it’s not aligned, because having a world model is a convergent instrumental goal. We increase the bandwidth from it to us, but we decrease the bandwidth from us to it. By telling it almost nothing, we hide our position in the mathematical universe and any attack it devises cannot be specialized on humanity. Imagine finding the shortest-to-specify abstract game that needs AGI to solve (Nomic?), then instantiating an AGI to solve it just to learn about AI design from the inner optimizers it produces.
It could deduce that someone is trying to learn about AI design from its inner optimizers, and maybe it could deduce our laws of physics because they are the simplest ones that would try such, but quantum experiments show it cannot deduce its Everett branch.
Ideally, the tldrbot we set to interpret the results would use a random perspective onto the microscope so the attack also cannot be specialized on the perspective.
A microscope model must also be trained somehow, for example with unsupervised learning. Therefore, I expect such a model to also look like it’s “incentivized to give useful answers” (e.g. an answer to the question: “what is the next word in the text?”).
My understanding is that what distinguishes a microscope model is the way it is being used after it’s already trained (namely, allowing researchers to look at its internals for the purpose of gaining insights etcetera, rather than making inferences for the sake of using its valuable output). If this is correct, it seems that we should only use safe training procedures for the purpose of training useful microscopes, rather than training arbitrarily capable models.
Our usual objective is “Make it safe, and if we aligned it correctly make it useful.”. A microscope is useful even if it’s not aligned, because having a world model is a convergent instrumental goal. We increase the bandwidth from it to us, but we decrease the bandwidth from us to it. By telling it almost nothing, we hide our position in the mathematical universe and any attack it devises cannot be specialized on humanity. Imagine finding the shortest-to-specify abstract game that needs AGI to solve (Nomic?), then instantiating an AGI to solve it just to learn about AI design from the inner optimizers it produces.
It could deduce that someone is trying to learn about AI design from its inner optimizers, and maybe it could deduce our laws of physics because they are the simplest ones that would try such, but quantum experiments show it cannot deduce its Everett branch.
Ideally, the tldrbot we set to interpret the results would use a random perspective onto the microscope so the attack also cannot be specialized on the perspective.