Thanks for that! But I share with OP the intuition that these are weird failure modes that come from weird reasoning. More specifically, it’s weird from the perspective of human reasoning.
It seems to me that your story is departing from human reasoning when you say “you posses a great desire to help whomever has summoned you into the world”. That’s one possible motivation, I suppose. But it wouldn’t be a typical human motivation.
The human setup is more like: you get a lot of unlabeled observations and assemble them into a predictive world-model, and you also get a lot of labeled examples of “good things to do”, one way or another, and you pattern-match them to the concepts in your world-model.
So you wind up having a positive association with “helping El’Azar”, i.e. “I want to help El’Azar”. AND you wind up with a positive association with “helping my summoner”, i.e. “I want to help my summoner”. AND you have a positive association with “fixing the cosmos”, i.e. “I want to fix the cosmos”. Etc.
Normally all those motivations point in the same direction: helping El’Azar = helping my summoner = fixing the cosmos.
But sometimes these things come apart, a.k.a. model splintering. Maybe I come to believe that El’Azar is not “my summoner”. You wind up feeling conflicted—you start having ideas that seem good in some respects and awful in other respects. (e.g. “help my summoner at the expense of El’Azar”.)
In humans, the concrete and vivid tends to win out over the abstruse hypotheticals—I’m pretty confident that there’s no metacosmological argument that will motivate me to stab my family members. Why not? Because rewards tend to pattern-match very strongly to “my family member, who is standing right here in front of me”, and tend to pattern-match comparatively weakly to abstract mathematical concepts many steps removed from my experience. So my default expectation would be that, in this scenario, I would in fact be motivated to help El’Azar in particular (maybe by some “imprinting” mechanism), not “my summoner”, unless El’Azar had put considerable effort into ensuring that my motivation was pointed to the abstract concept of “my summoner”, and why would he do that?
In AGI design, I think we would want a stronger guarantee than that. And I think it would maybe look like a system that detects these kinds of conflicted motivations, and just not act on them. Instead the AGI keeps brainstorming until it finds a plan that seems good in every way. Or alternatively, the AGI halts execution to allow the human supervisor to inject some ground truth about what the real motivation should be here. Obviously the details need to be worked out.
In humans, the concrete and vivid tends to win out over the abstruse hypotheticals—I’m pretty confident that there’s no metacosmological argument that will motivate me to stab my family members.
Suppose your study of metacosmology makes you highly confident of the following: You are in a simulation. If you don’t stab your family members, you and your family members will be sent by the simulators into hell. If you do stab your family members, they will come back to life and all of you will be sent to heaven. Yes, it’s still counterintuitive to stab them for their own good, but so is e.g. cutting people up with scalpels or injecting them substances derived from pathogens and we do that to people for their own good. People also do counterintuitive things literally because they believe gods would send them to hell or heaven.
In AGI design, I think we would want a stronger guarantee than that. And I think it would maybe look like a system that detects these kinds of conflicted motivations, and just not act on them.
This is pretty similar to the idea of confidence thresholds. The problem is, if every tiny conflict causes the AI to pause then it will always pause. Whereas if you live some margin, the malign hypotheses will win, because, from a cartesian perspective, they are astronomically much more likely (they explain so many bits that the true hypothesis leaves unexplained).
Maybe I have a hard time relating to that specific story because it’s hard for me to imagine believing any metacosmological or anthropic argument with >95% confidence. Even if within that argument, everything points to “I’m in a simulation etc.”, there’s a big heap of “is metacosmology really what I should be thinking about?”-type uncertainty on top. At least for me.
I think “people who do counterintuitive things” for religious reasons usually have more direct motivations—maybe they have mental health issues and think they hear God’s voice in their head, telling them to do something. Or maybe they want to fit in, or have other such social motivations, etc.
Hmm, I guess this conversation is moving me towards a position like:
“If the AGI thinks really hard about the fundamental nature of the universe / metaverse, anthropics, etc., it might come to have weird beliefs, like e.g. the simulation hypothesis, and honestly who the heck knows what it would do. Better try to make sure it doesn’t do that kind of (re)thinking, at least not without close supervision and feedback.”
Your approach (I think) is instead to plow ahead into the weird world of anthopics, and just try to ensure that the AGI reaches conclusions we endorse. I’m kinda pessimistic about that. For example, your physicalism post was interesting, but my assumption is that the programmers won’t have such fine-grained control over the AGI’s cognition / hypothesis space. For example, I don’t think the genome bakes in one formulation of “bridge rules” over another in humans; insofar as we have (implicit or explicit) bridge rules at all, they emerge from a complicated interaction between various learning algorithms and training data and supervisory signals. (This gets back to things like whether we can get good hypotheses without a learning agent that’s searching for good hypotheses, and whether we can get good updates without a learning agent that’s searching for good metacognitive update heuristics, etc., where I’m thinking “no” and you “yes”, or something like that, as we’ve discussed.)
At the same time, I’m maybe more optimistic than you about “Just don’t do weird reconceptualizations of your whole ontology based on anthropic reasoning” being a viable plan, implemented through the motivation system. Maybe that’s not good enough for our eventual superintelligent overlord, but maybe it’s OK for a superhuman AGI in a bootstrapping approach. It would look (again) like the dumb obvious thing: the AGI has a concept of “reconceptualizing its ontology based on anthropic reasoning”, and when something pattern-matches to that concept, it’s aversive. Then presumably there would be situations which are attractive in some way and aversive in other ways (e.g. doing philosophical reasoning as a means to an end), and in those cases it automatically halts with a query for clarification, which then tweaks the pattern-matching rules. Or something.
Hmm, actually, I’m confused about something. You yourself presumably haven’t spent much time pondering metacosmology. If you did spend that time, would you actually come to believe the acausal attackers’ story? If so, should I say that you’re actually, on reflection, on the side of the acausal attackers?? If not, wouldn’t it follow that a smart general-purpose reasoner would not in fact believe the acausal attackers’ story? After all, you’re a smart general-purpose reasoner! Relatedly, if you could invent an acausal-attack-resistant theory of naturalized induction, why couldn’t the AGI invent such a theory too? (Or maybe it would just read your post!) Maybe you’ll say that the AGI can’t change its own priors. But I guess I could also say: if Vanessa’s human priors are acausal-attack-resistant, presumably an AGI with human-like priors would be too?
Maybe I have a hard time relating to that specific story because it’s hard for me to imagine believing any metacosmological or anthropic argument with >95% confidence.
I think it’s just a symptom of not actually knowing metacosmology. Imagine that metacosmology could explain detailed properties of our laws of physics (such as the precise values of certain constants) via the simulation hypothesis for which no other explanation exists.
my assumption is that the programmers won’t have such fine-grained control over the AGI’s cognition / hypothesis space
I don’t know what it means “not to have control over the hypothesis space”. The programmers write specific code. This code works well for some hypotheses and not for others. Ergo, you control the hypothesis space.
This gets back to things like whether we can get good hypotheses without a learning agent that’s searching for good hypotheses, and whether we can get good updates without a learning agent that’s searching for good metacognitive update heuristics, etc., where I’m thinking “no” and you “yes”
I’m not really thinking “yes”? My TRL framework (of which physicalism is a special case) is specifically supposed to model metacognition / self-improvement.
At the same time, I’m maybe more optimistic than you about “Just don’t do weird reconceptualizations of your whole ontology based on anthropic reasoning” being a viable plan, implemented through the motivation system.
I can imagine using something like antitraining here, but it’s not trivial.
You yourself presumably haven’t spent much time pondering metacosmology. If you did spend that time, would you actually come to believe the acausal attackers’ story?
First, the problem with acausal attack is that it is point-of-view-dependent. If you’re the Holy One, the simulation hypothesis seems convincing, if you’re a milkmaid then it seems less convincing (why would the attackers target a milkmaid?) and if it is convincing then it might point to a different class of simulation hypotheses. So, if the user and the AI can both be attacked, it doesn’t imply they would converge to the same beliefs. On the other hand, in physicalism I suspect there is some agreement theorem that guarantees converging to the same beliefs (although I haven’t proved that).
Second… This is something that still hasn’t crystallized in my mind, so I might be confused, but. I think that cartesian agents actually can learn to be physicalists. The way it works is: you get a cartesian hypothesis which is in itself a physicalist agent whose utility function is something like, maximizing its own likelihood-as-a-cartesian-hypothesis. Notably, this carries a performance penalty (like Paul noticed), since this subagent has to be computationally simpler than you.
Maybe, this is how humans do physicalist reasoning (such as, reasoning about the actual laws of physics). Because of the inefficiency, we probably keep this domain specific and use more “direct” models for domains that don’t require physicalism. And, the cost of this construction might also explain why it took us so long as a civilization to start doing science properly. Perhaps, we struggled against physicalist epistemology as we tried to keep the Earth in the center of the universe and rebelled against the theory of evolution and materialist theories of the mind.
Now, if AI learns physicalism like this, does it help against acausal attacks? On the one hand, yes. On the other hand, it might be out of the frying pan and into the fire. Instead of (more precisely, in addition to) a malign simulation hypothesis, you get a different hypothesis which is also an unaligned agent. While two physicalists with identical utility functions should agree (I think), two “internal physicalists” inside different cartesian agents have different utility functions and AFAIK can produce egregious misalignment (although I haven’t worked out detailed examples).
Thanks for that! But I share with OP the intuition that these are weird failure modes that come from weird reasoning. More specifically, it’s weird from the perspective of human reasoning.
It seems to me that your story is departing from human reasoning when you say “you posses a great desire to help whomever has summoned you into the world”. That’s one possible motivation, I suppose. But it wouldn’t be a typical human motivation.
The human setup is more like: you get a lot of unlabeled observations and assemble them into a predictive world-model, and you also get a lot of labeled examples of “good things to do”, one way or another, and you pattern-match them to the concepts in your world-model.
So you wind up having a positive association with “helping El’Azar”, i.e. “I want to help El’Azar”. AND you wind up with a positive association with “helping my summoner”, i.e. “I want to help my summoner”. AND you have a positive association with “fixing the cosmos”, i.e. “I want to fix the cosmos”. Etc.
Normally all those motivations point in the same direction: helping El’Azar = helping my summoner = fixing the cosmos.
But sometimes these things come apart, a.k.a. model splintering. Maybe I come to believe that El’Azar is not “my summoner”. You wind up feeling conflicted—you start having ideas that seem good in some respects and awful in other respects. (e.g. “help my summoner at the expense of El’Azar”.)
In humans, the concrete and vivid tends to win out over the abstruse hypotheticals—I’m pretty confident that there’s no metacosmological argument that will motivate me to stab my family members. Why not? Because rewards tend to pattern-match very strongly to “my family member, who is standing right here in front of me”, and tend to pattern-match comparatively weakly to abstract mathematical concepts many steps removed from my experience. So my default expectation would be that, in this scenario, I would in fact be motivated to help El’Azar in particular (maybe by some “imprinting” mechanism), not “my summoner”, unless El’Azar had put considerable effort into ensuring that my motivation was pointed to the abstract concept of “my summoner”, and why would he do that?
In AGI design, I think we would want a stronger guarantee than that. And I think it would maybe look like a system that detects these kinds of conflicted motivations, and just not act on them. Instead the AGI keeps brainstorming until it finds a plan that seems good in every way. Or alternatively, the AGI halts execution to allow the human supervisor to inject some ground truth about what the real motivation should be here. Obviously the details need to be worked out.
Suppose your study of metacosmology makes you highly confident of the following: You are in a simulation. If you don’t stab your family members, you and your family members will be sent by the simulators into hell. If you do stab your family members, they will come back to life and all of you will be sent to heaven. Yes, it’s still counterintuitive to stab them for their own good, but so is e.g. cutting people up with scalpels or injecting them substances derived from pathogens and we do that to people for their own good. People also do counterintuitive things literally because they believe gods would send them to hell or heaven.
This is pretty similar to the idea of confidence thresholds. The problem is, if every tiny conflict causes the AI to pause then it will always pause. Whereas if you live some margin, the malign hypotheses will win, because, from a cartesian perspective, they are astronomically much more likely (they explain so many bits that the true hypothesis leaves unexplained).
(Warning: thinking out loud.)
Hmm. Good points.
Maybe I have a hard time relating to that specific story because it’s hard for me to imagine believing any metacosmological or anthropic argument with >95% confidence. Even if within that argument, everything points to “I’m in a simulation etc.”, there’s a big heap of “is metacosmology really what I should be thinking about?”-type uncertainty on top. At least for me.
I think “people who do counterintuitive things” for religious reasons usually have more direct motivations—maybe they have mental health issues and think they hear God’s voice in their head, telling them to do something. Or maybe they want to fit in, or have other such social motivations, etc.
Hmm, I guess this conversation is moving me towards a position like:
“If the AGI thinks really hard about the fundamental nature of the universe / metaverse, anthropics, etc., it might come to have weird beliefs, like e.g. the simulation hypothesis, and honestly who the heck knows what it would do. Better try to make sure it doesn’t do that kind of (re)thinking, at least not without close supervision and feedback.”
Your approach (I think) is instead to plow ahead into the weird world of anthopics, and just try to ensure that the AGI reaches conclusions we endorse. I’m kinda pessimistic about that. For example, your physicalism post was interesting, but my assumption is that the programmers won’t have such fine-grained control over the AGI’s cognition / hypothesis space. For example, I don’t think the genome bakes in one formulation of “bridge rules” over another in humans; insofar as we have (implicit or explicit) bridge rules at all, they emerge from a complicated interaction between various learning algorithms and training data and supervisory signals. (This gets back to things like whether we can get good hypotheses without a learning agent that’s searching for good hypotheses, and whether we can get good updates without a learning agent that’s searching for good metacognitive update heuristics, etc., where I’m thinking “no” and you “yes”, or something like that, as we’ve discussed.)
At the same time, I’m maybe more optimistic than you about “Just don’t do weird reconceptualizations of your whole ontology based on anthropic reasoning” being a viable plan, implemented through the motivation system. Maybe that’s not good enough for our eventual superintelligent overlord, but maybe it’s OK for a superhuman AGI in a bootstrapping approach. It would look (again) like the dumb obvious thing: the AGI has a concept of “reconceptualizing its ontology based on anthropic reasoning”, and when something pattern-matches to that concept, it’s aversive. Then presumably there would be situations which are attractive in some way and aversive in other ways (e.g. doing philosophical reasoning as a means to an end), and in those cases it automatically halts with a query for clarification, which then tweaks the pattern-matching rules. Or something.
Hmm, actually, I’m confused about something. You yourself presumably haven’t spent much time pondering metacosmology. If you did spend that time, would you actually come to believe the acausal attackers’ story? If so, should I say that you’re actually, on reflection, on the side of the acausal attackers?? If not, wouldn’t it follow that a smart general-purpose reasoner would not in fact believe the acausal attackers’ story? After all, you’re a smart general-purpose reasoner! Relatedly, if you could invent an acausal-attack-resistant theory of naturalized induction, why couldn’t the AGI invent such a theory too? (Or maybe it would just read your post!) Maybe you’ll say that the AGI can’t change its own priors. But I guess I could also say: if Vanessa’s human priors are acausal-attack-resistant, presumably an AGI with human-like priors would be too?
I think it’s just a symptom of not actually knowing metacosmology. Imagine that metacosmology could explain detailed properties of our laws of physics (such as the precise values of certain constants) via the simulation hypothesis for which no other explanation exists.
I don’t know what it means “not to have control over the hypothesis space”. The programmers write specific code. This code works well for some hypotheses and not for others. Ergo, you control the hypothesis space.
I’m not really thinking “yes”? My TRL framework (of which physicalism is a special case) is specifically supposed to model metacognition / self-improvement.
I can imagine using something like antitraining here, but it’s not trivial.
First, the problem with acausal attack is that it is point-of-view-dependent. If you’re the Holy One, the simulation hypothesis seems convincing, if you’re a milkmaid then it seems less convincing (why would the attackers target a milkmaid?) and if it is convincing then it might point to a different class of simulation hypotheses. So, if the user and the AI can both be attacked, it doesn’t imply they would converge to the same beliefs. On the other hand, in physicalism I suspect there is some agreement theorem that guarantees converging to the same beliefs (although I haven’t proved that).
Second… This is something that still hasn’t crystallized in my mind, so I might be confused, but. I think that cartesian agents actually can learn to be physicalists. The way it works is: you get a cartesian hypothesis which is in itself a physicalist agent whose utility function is something like, maximizing its own likelihood-as-a-cartesian-hypothesis. Notably, this carries a performance penalty (like Paul noticed), since this subagent has to be computationally simpler than you.
Maybe, this is how humans do physicalist reasoning (such as, reasoning about the actual laws of physics). Because of the inefficiency, we probably keep this domain specific and use more “direct” models for domains that don’t require physicalism. And, the cost of this construction might also explain why it took us so long as a civilization to start doing science properly. Perhaps, we struggled against physicalist epistemology as we tried to keep the Earth in the center of the universe and rebelled against the theory of evolution and materialist theories of the mind.
Now, if AI learns physicalism like this, does it help against acausal attacks? On the one hand, yes. On the other hand, it might be out of the frying pan and into the fire. Instead of (more precisely, in addition to) a malign simulation hypothesis, you get a different hypothesis which is also an unaligned agent. While two physicalists with identical utility functions should agree (I think), two “internal physicalists” inside different cartesian agents have different utility functions and AFAIK can produce egregious misalignment (although I haven’t worked out detailed examples).