This is great, thanks again for your time and thoughtful commentary!
RE “I’m not entirely convinced that predictions should be made in a way that’s completely divorced from their effects on the world.”: My vision is to make a non-agential question-answering AGI, thus avoiding value alignment. I don’t claim that this is definitely the One Right Answer To AGI Safety (see “4. Give up, and just make an agent with value-aligned goals” in the post), but I think it is a plausible (and neglected) candidate answer. See also my post In defense of oracle (tool) AI research for why I think it would solve the AGI safety problem.
If an AGI applies its intelligence and world model to its own output, choosing that output partly for its downstream effects as predicted by the model, then I say it’s a goal-seeking agent. In this case, we need to solve value alignment—even if the goal is as simple as “answer my question”. (We would need to make sure that the goal is what it’s supposed to be, as opposed to a proxy goal, or a weird alien interpretation where rewiring the operator’s brain counts as “answer my question”.) Again, I’m not opposed to building agents after solving value alignment, but we haven’t solved value alignment yet, and thus it’s worth exploring the other option: build a non-agent which does not intelligently model the downstream effects of its output at all (or if it does model it incidentally, to not do anything with that information).
Interfacing with a non-agential AGI is generally awkward. You can’t directly ask it to do things, or to find a better way to communicate. My proposal here is to ask questions like “If there were no AGIs in the world, what’s the likeliest way that a person would find a cure for Alzheimer’s?” This type of question does not require the AGI to think through the consequence of its output, and it also has other nice properties (it should give less weird and alien and human-unfriendly answers than the solutions a direct goal-seeking agent would find).
OK, that’s my grand vision and motivation, and why I’m hoping for “no reasoning about the consequences of one’s output whatsoever”, as opposed to finding self-fulfilling predictions. (Maybe very very mild optimization for the consequences of one’s outputs is OK, but I’m nervous.)
Your other question was: if a system is making manipulative predictions, towards what goal is it manipulating? Well, you noticed correctly, I’m not sure, and I keep changing my mind. And it may also be different answers depending on the algorithm details.
My top expectation is that it will manipulate towards getting further inputs that its model thinks are typical, high-probability inputs. If X implies Y, and P(Y) is low, that might sometimes spuriously push down P(X), and thus the system will pick those X’s that result in high P(Y).
My secondary expectation is that it might manipulate towards unambiguous, low-entropy outputs. This is the expectation if the system picks out the single most likely ongoing long-term context, and output a prediction contingent on that. (If instead the system randomly draws from the probability distribution of all possible contexts, this wouldn’t happen, as suggested by interstice’s comments on this page.) So if X1 leads to one of 500 slightly different Y1′s (Y1a, Y1b,...), while X2 definitely leads to only one specific Y2, then Y2 is probably the most likely single Y, even if all the Y1′s in aggregate are likelier than Y2; so X2 is at an unfair advantage.
Beyond those two, I suspect there can be other goals but they depend on the algorithm and its heuristics.
This is great, thanks again for your time and thoughtful commentary!
RE “I’m not entirely convinced that predictions should be made in a way that’s completely divorced from their effects on the world.”: My vision is to make a non-agential question-answering AGI, thus avoiding value alignment. I don’t claim that this is definitely the One Right Answer To AGI Safety (see “4. Give up, and just make an agent with value-aligned goals” in the post), but I think it is a plausible (and neglected) candidate answer. See also my post In defense of oracle (tool) AI research for why I think it would solve the AGI safety problem.
If an AGI applies its intelligence and world model to its own output, choosing that output partly for its downstream effects as predicted by the model, then I say it’s a goal-seeking agent. In this case, we need to solve value alignment—even if the goal is as simple as “answer my question”. (We would need to make sure that the goal is what it’s supposed to be, as opposed to a proxy goal, or a weird alien interpretation where rewiring the operator’s brain counts as “answer my question”.) Again, I’m not opposed to building agents after solving value alignment, but we haven’t solved value alignment yet, and thus it’s worth exploring the other option: build a non-agent which does not intelligently model the downstream effects of its output at all (or if it does model it incidentally, to not do anything with that information).
Interfacing with a non-agential AGI is generally awkward. You can’t directly ask it to do things, or to find a better way to communicate. My proposal here is to ask questions like “If there were no AGIs in the world, what’s the likeliest way that a person would find a cure for Alzheimer’s?” This type of question does not require the AGI to think through the consequence of its output, and it also has other nice properties (it should give less weird and alien and human-unfriendly answers than the solutions a direct goal-seeking agent would find).
OK, that’s my grand vision and motivation, and why I’m hoping for “no reasoning about the consequences of one’s output whatsoever”, as opposed to finding self-fulfilling predictions. (Maybe very very mild optimization for the consequences of one’s outputs is OK, but I’m nervous.)
Your other question was: if a system is making manipulative predictions, towards what goal is it manipulating? Well, you noticed correctly, I’m not sure, and I keep changing my mind. And it may also be different answers depending on the algorithm details.
My top expectation is that it will manipulate towards getting further inputs that its model thinks are typical, high-probability inputs. If X implies Y, and P(Y) is low, that might sometimes spuriously push down P(X), and thus the system will pick those X’s that result in high P(Y).
My secondary expectation is that it might manipulate towards unambiguous, low-entropy outputs. This is the expectation if the system picks out the single most likely ongoing long-term context, and output a prediction contingent on that. (If instead the system randomly draws from the probability distribution of all possible contexts, this wouldn’t happen, as suggested by interstice’s comments on this page.) So if X1 leads to one of 500 slightly different Y1′s (Y1a, Y1b,...), while X2 definitely leads to only one specific Y2, then Y2 is probably the most likely single Y, even if all the Y1′s in aggregate are likelier than Y2; so X2 is at an unfair advantage.
Beyond those two, I suspect there can be other goals but they depend on the algorithm and its heuristics.