The “honeypot” sections of the post also seem to rely upon a premise that the base generative model (not Alice) will try to run some world simulations inside itself while answering these types of prompts. I’m not sure this will happen: given the unreliability of simulations, the model will probably need to run several (many?) simulations in parallel and then analyse/summarise/draw conclusions from their results to emit its actual answer. Even running a single such simulation might be astronomically expensive, well over the model’s budget for answering the prompt (even if we give it a huge budget).
I think about superhuman generative models as superhuman futurists and scientists who can account for hundreds of distinct world models (economic, historical, anthropological, scientific, etc., including models embedded in the model and not originating from humans), compare their predictions, reason about the discrepancies, and come up with the most likely prediction or explanation of something. But not as world simulators.
Assuming we have such a model, I’m not sure why we can’t just ask it a direct question, “What course of action will lead to solving human/AI agent alignment problem with the highest probability?” Why do we need a proxy, simulated Alice model?
The “honeypot” sections of the post also seem to rely upon a premise that the base generative model (not Alice) will try to run some world simulations inside itself while answering these types of prompts. I’m not sure this will happen: given the unreliability of simulations, the model will probably need to run several (many?) simulations in parallel and then analyse/summarise/draw conclusions from their results to emit its actual answer. Even running a single such simulation might be astronomically expensive, well over the model’s budget for answering the prompt (even if we give it a huge budget).
I think about superhuman generative models as superhuman futurists and scientists who can account for hundreds of distinct world models (economic, historical, anthropological, scientific, etc., including models embedded in the model and not originating from humans), compare their predictions, reason about the discrepancies, and come up with the most likely prediction or explanation of something. But not as world simulators.
Assuming we have such a model, I’m not sure why we can’t just ask it a direct question, “What course of action will lead to solving human/AI agent alignment problem with the highest probability?” Why do we need a proxy, simulated Alice model?
The model just predicts the next observation/token. So there’s no guarantee that you can ask it a question and get a true answer.