I’m confused about the “Patient Research” scenario, is it supposed to be realistic, or a thought experiment? Are you talking about an abstract generative model with unlimited computational resources or an actual model that may exist in X years from now? If you are talking about a realistic model, why do you think that the model will simulate anything at all? Solving complex problems with AI doesn’t necessarily mean simulations, rather, simulations is a “dumb”, “brute force” approach: e. g., AphaFold doesn’t simulate protein folding, it just predicts it straight.
But if you do expect that the model will run simulations inside itself, what level of simulation fidelity (macroeconomic vs. humans vs. cells vs. molecules) do you expect (or assume)? And which period is simulated: from the dawn of history (which may have resulted in the described world, though I don’t understand how it should initialise this simulation so that it results in the described world—this seems impossible to me), or from the year 2000 till the year 4000 (same problem), or from the “Eureka” moment in the head of one of the researchers closer to the year 4000 which led to writing the book? However, the latter case (or its shorter versions, e. g. just simulating the process of writing the book) doesn’t really look like a “simulation” to me, but rather just a normal generative/predictive process.
I have a feeling that imagining superhuman generative models as oracles that can simulate entire worlds is Sci-Fi, even by LessWrong standards. There is no way around the halting problem and the second law of thermodynamics. But maybe I misunderstand something? In my view, superhuman generative models can run simulations inside themselves, like researchers in economics or evolutionary biology may run some contained simulations while testing their hypotheses, while recognising the limits of such simulations (the irreducible complexity of the real world, which will never unfold exactly as any simulation) and accounting for these limits in their conclusions (e. g., reflected in credences). Still, superhuman generative models won’t be able to simulate the whole planet with any degree of accuracy for any time. Superhuman generative models will also be well aware of this fact themselves.
Are you talking about an abstract generative model with unlimited computational resources or an actual model that may exist in X years from now?
I think it’s plausible we’ll have models capable of pretty sophisticated simulation, with access to enormous computational resources, before AGI becomes a serious threat.
Solving complex problems with AI doesn’t necessarily mean simulations
Totally agree. I’m using the word simulation a bit loosely. All I mean is that the AI can predict observations as if from a simulation.
But if you do expect that the model will run simulations inside itself, what level of simulation fidelity
I’m imagining something like “the model has high fidelity on human thought patterns and some basic physical-world-model stuff like that human bodies are fragile, but it doesn’t simulate individual cells or anything like that”.
There is no way around the halting problem and the second law of thermodynamics.
I don’t think the halting problem is relevant here? Nor the second law… but maybe I’m missing something?
I should have been clearer. Don’t exactly remember what I was thinking about now. Maybe about the suggested prompt in the “Simulating Human Alignment Researchers” section: if the oracle is suggested to somehow cut through everything that would have happened in 2000 (or more?) years in its simulation, it should either run for thousands or millions of years itself (a sufficiently high-fidelity simulation) or, or the simulation will dramatically diverge from what is actually likely to happen in reality. (However, there is no particular relation between this idea, the halting problem, and the second law).
Alternatively, if the prompt is designed just to “prime the oracle’s imagination” before writing the textbook, rather than an invitation for an elaborate simulation, I don’t see how it’s at all safer than plainly asking the oracle to write the alignment textbook with proofs.
The “honeypot” sections of the post also seem to rely upon a premise that the base generative model (not Alice) will try to run some world simulations inside itself while answering these types of prompts. I’m not sure this will happen: given the unreliability of simulations, the model will probably need to run several (many?) simulations in parallel and then analyse/summarise/draw conclusions from their results to emit its actual answer. Even running a single such simulation might be astronomically expensive, well over the model’s budget for answering the prompt (even if we give it a huge budget).
I think about superhuman generative models as superhuman futurists and scientists who can account for hundreds of distinct world models (economic, historical, anthropological, scientific, etc., including models embedded in the model and not originating from humans), compare their predictions, reason about the discrepancies, and come up with the most likely prediction or explanation of something. But not as world simulators.
Assuming we have such a model, I’m not sure why we can’t just ask it a direct question, “What course of action will lead to solving human/AI agent alignment problem with the highest probability?” Why do we need a proxy, simulated Alice model?
I’m confused about the “Patient Research” scenario, is it supposed to be realistic, or a thought experiment? Are you talking about an abstract generative model with unlimited computational resources or an actual model that may exist in X years from now? If you are talking about a realistic model, why do you think that the model will simulate anything at all? Solving complex problems with AI doesn’t necessarily mean simulations, rather, simulations is a “dumb”, “brute force” approach: e. g., AphaFold doesn’t simulate protein folding, it just predicts it straight.
But if you do expect that the model will run simulations inside itself, what level of simulation fidelity (macroeconomic vs. humans vs. cells vs. molecules) do you expect (or assume)? And which period is simulated: from the dawn of history (which may have resulted in the described world, though I don’t understand how it should initialise this simulation so that it results in the described world—this seems impossible to me), or from the year 2000 till the year 4000 (same problem), or from the “Eureka” moment in the head of one of the researchers closer to the year 4000 which led to writing the book? However, the latter case (or its shorter versions, e. g. just simulating the process of writing the book) doesn’t really look like a “simulation” to me, but rather just a normal generative/predictive process.
I have a feeling that imagining superhuman generative models as oracles that can simulate entire worlds is Sci-Fi, even by LessWrong standards. There is no way around the halting problem and the second law of thermodynamics. But maybe I misunderstand something? In my view, superhuman generative models can run simulations inside themselves, like researchers in economics or evolutionary biology may run some contained simulations while testing their hypotheses, while recognising the limits of such simulations (the irreducible complexity of the real world, which will never unfold exactly as any simulation) and accounting for these limits in their conclusions (e. g., reflected in credences). Still, superhuman generative models won’t be able to simulate the whole planet with any degree of accuracy for any time. Superhuman generative models will also be well aware of this fact themselves.
I think it’s plausible we’ll have models capable of pretty sophisticated simulation, with access to enormous computational resources, before AGI becomes a serious threat.
Totally agree. I’m using the word simulation a bit loosely. All I mean is that the AI can predict observations as if from a simulation.
I’m imagining something like “the model has high fidelity on human thought patterns and some basic physical-world-model stuff like that human bodies are fragile, but it doesn’t simulate individual cells or anything like that”.
I don’t think the halting problem is relevant here? Nor the second law… but maybe I’m missing something?
I should have been clearer. Don’t exactly remember what I was thinking about now. Maybe about the suggested prompt in the “Simulating Human Alignment Researchers” section: if the oracle is suggested to somehow cut through everything that would have happened in 2000 (or more?) years in its simulation, it should either run for thousands or millions of years itself (a sufficiently high-fidelity simulation) or, or the simulation will dramatically diverge from what is actually likely to happen in reality. (However, there is no particular relation between this idea, the halting problem, and the second law).
Alternatively, if the prompt is designed just to “prime the oracle’s imagination” before writing the textbook, rather than an invitation for an elaborate simulation, I don’t see how it’s at all safer than plainly asking the oracle to write the alignment textbook with proofs.
The “honeypot” sections of the post also seem to rely upon a premise that the base generative model (not Alice) will try to run some world simulations inside itself while answering these types of prompts. I’m not sure this will happen: given the unreliability of simulations, the model will probably need to run several (many?) simulations in parallel and then analyse/summarise/draw conclusions from their results to emit its actual answer. Even running a single such simulation might be astronomically expensive, well over the model’s budget for answering the prompt (even if we give it a huge budget).
I think about superhuman generative models as superhuman futurists and scientists who can account for hundreds of distinct world models (economic, historical, anthropological, scientific, etc., including models embedded in the model and not originating from humans), compare their predictions, reason about the discrepancies, and come up with the most likely prediction or explanation of something. But not as world simulators.
Assuming we have such a model, I’m not sure why we can’t just ask it a direct question, “What course of action will lead to solving human/AI agent alignment problem with the highest probability?” Why do we need a proxy, simulated Alice model?
The model just predicts the next observation/token. So there’s no guarantee that you can ask it a question and get a true answer.