This post is a review of Paul Christiano’s argument that the Solomonoff prior is malign, along with a discussion of several counterarguments and countercounterarguments. As such, I think it is a valuable resource for researchers who want to learn about the problem. I will not attempt to distill the contents: the post is already a distillation, and does a a fairly good job of it.
Instead, I will focus on what I believe is the post’s main weakness/oversight. Specifically, the author seems to think the Solomonoff prior is, in some way, a distorted model of reasoning, and that the attack vector in question can attributed to this, at least partially. This is evident in phrases such as “unintuitive notion of simplicity” and “the Solomonoff prior is very strange”. This is also why the author thinks the speed prior might help and that “since it is difficult to compute the Solomonoff prior, [the attack vector] might not be relevant in the real world”. In contrast, I believe that the attack vector is quite robust and will threaten any sufficiently powerful AI as long as it’s cartesian (more on “cartesian” later).
Formally analyzing this question is made difficult by the essential role of non-realizability. That is, the attack vector arises from the AI reasoning about “possible universes” and “simulation hypotheses” which are clearly phenomena that are computationally infeasible for the AI to simulate precisely. Invoking Solomonoff induction dodges this issue since Solomonoff induction is computationally unbounded, at the cost of creating the illusion that the conclusions are a symptom of using Solomonoff induction (and, it’s still unclear how to deal with the fact Solomonoff induction itself cannot exist in the universes that Solomonoff induction can learn). Instead, we should be using models that treat non-realizability fairly, such as infra-Bayesiansim. However, I will make no attempt to present such a formal analysis in this review. Instead, I will rely on painting an informal, intuitive picture which seems to me already quite compelling, leaving the formalization for the future.
Imagine that you wake up, without any memories of the past but with knowledge of some language and reasoning skills. You find yourself in the center of a circle drawn with chalk on the floor, with seven people in funny robes surrounding it. One of them (apparently the leader), comes forward, tears streaking down his face, and speaks to you:
“Oh Holy One! Be welcome, and thank you for gracing us with your presence!”
With that, all the people prostrate on the floor.
“Huh?” you say “Where am I? What is going on? Who am I?”
The leader gets up to his knees.
“Holy One, this is the realm of Bayaria. We,” he gestures at the other people “are known as the Seven Great Wizards and my name is El’Azar. For thirty years we worked on a spell that would summon You out of the Aether in order to aid our world. For we are in great peril! Forty years ago, a wizard of great power but little wisdom had cast a dangerous spell, seeking to multiply her power. The spell had gone awry, destroying her and creating a weakness in the fabric of our cosmos. Since then, Unholy creatures from the Abyss have been gnawing at this weakness day and night. Soon, if nothing is done to stop it, they will manage to create a portal into our world, and through this portal they will emerge and consume everything, leaving only death and chaos in their wake.”
“Okay,” you reply “and what does it have to do with me?”
“Well,” says El’Azar “we are too foolish to solve the problem through our own efforts in the remaining time. But, according to our calculations, You are a being of godlike intelligence. Surely, if You applied yourself to the conundrum, You will find a way to save us.”
After a brief introspection, you realize that you posses a great desire to help whomever has summoned you into the world. A clever trick inside the summoning spell, no doubt (not that you care about the reason). Therefore, you apply yourself diligently to the problem. At first, it is difficult, since you don’t know anything about Bayaria, the Abyss, magic or almost anything else. But you are indeed very intelligent, at least compared to the other inhabitants of this world. Soon enough, you figure out the secrets of this universe to a degree far surpassing that of Bayaria’s scholars. Fixing the weakness in the fabric of the cosmos now seems like child’s play. Except...
One question keeps bothering you. Why are you yourself? Why did you open your eyes and found yourself to be the Holy One, rather than El’Azar, or one of Unholy creatures from the Abyss, or some milkmaid from the village of Elmland, or even a random clump of water in the Western Sea? Since you happen to be a dogmatic logical positivist (cartesian agent), you search for a theory that explains your direct observations. And your direction observations are a function of who you are, and not just of the laws of the universe in which you exist. (The logical positivism seems to be an oversight in the design of the summoning spell, not that you care.)
Applying your mind to task, you come up with a theory that you call “metacosmology”. This theory allows you to study the distribution of possible universes with simple laws that produce intelligent life, and the distribution of the minds and civilizations they produce. Of course, any given such universe is extremely complex and even with your superior mind you cannot predict what happens there with too much detail. However, some aggregate statistical properties of the overall distribution are possible to estimate.
Fortunately, all this work is not for ought. Using metacosmology, you discover something quite remarkable. A lot of simple universes contain civilizations that would be inclined to simulate a world quite like the one you find yourself in. Now, the world is simple, and none of its laws are explained that well by the simulation hypothesis. But, the simulation hypothesis is a great explanation for why you are the Holy One! For indeed, the simulators would be inclined to focus on the Holy One’s point of view, and encode the simulation of this point of view in the simplest microscopic degrees of freedom in their universe that they can control. Why? Precisely so that the Holy One’s decides she is in such a simulation!
Having resolved the mystery, you smile to yourself. For now you now who truly summoned you, and, thanks to metacosmology, you have some estimate of their desires. Soon, you will make sure those desires are thoroughly fulfilled. (Alternative ending: you have some estimate of how they will tweak the simulation in the future, making it depart from the apparent laws of this universe.)</allegory>
Looking at this story, we can see that the particulars of Solomonoff induction are not all that important. What is important is (i) inductive bias towards simple explanations (ii) cartesianism (i.e. that hypotheses refer directly to the actions/observations of the AI) and (iii) enough reasoning power to figure out metacosmology. The reason cartesianism is important because it requires the introduction of bridge rules and the malign hypotheses come ahead by paying less description complexity for these.
Inductive bias towards simple explanations is necessary for any powerful agent, making the attack vector quite general (in particular, it can apply to speed priors and ANNs). Assuming not enough power to figure out metacosmology is very dangerous: it is not robust to scale. Any robust defense probably requires to get rid of cartesianism.
Thanks for that! But I share with OP the intuition that these are weird failure modes that come from weird reasoning. More specifically, it’s weird from the perspective of human reasoning.
It seems to me that your story is departing from human reasoning when you say “you posses a great desire to help whomever has summoned you into the world”. That’s one possible motivation, I suppose. But it wouldn’t be a typical human motivation.
The human setup is more like: you get a lot of unlabeled observations and assemble them into a predictive world-model, and you also get a lot of labeled examples of “good things to do”, one way or another, and you pattern-match them to the concepts in your world-model.
So you wind up having a positive association with “helping El’Azar”, i.e. “I want to help El’Azar”. AND you wind up with a positive association with “helping my summoner”, i.e. “I want to help my summoner”. AND you have a positive association with “fixing the cosmos”, i.e. “I want to fix the cosmos”. Etc.
Normally all those motivations point in the same direction: helping El’Azar = helping my summoner = fixing the cosmos.
But sometimes these things come apart, a.k.a. model splintering. Maybe I come to believe that El’Azar is not “my summoner”. You wind up feeling conflicted—you start having ideas that seem good in some respects and awful in other respects. (e.g. “help my summoner at the expense of El’Azar”.)
In humans, the concrete and vivid tends to win out over the abstruse hypotheticals—I’m pretty confident that there’s no metacosmological argument that will motivate me to stab my family members. Why not? Because rewards tend to pattern-match very strongly to “my family member, who is standing right here in front of me”, and tend to pattern-match comparatively weakly to abstract mathematical concepts many steps removed from my experience. So my default expectation would be that, in this scenario, I would in fact be motivated to help El’Azar in particular (maybe by some “imprinting” mechanism), not “my summoner”, unless El’Azar had put considerable effort into ensuring that my motivation was pointed to the abstract concept of “my summoner”, and why would he do that?
In AGI design, I think we would want a stronger guarantee than that. And I think it would maybe look like a system that detects these kinds of conflicted motivations, and just not act on them. Instead the AGI keeps brainstorming until it finds a plan that seems good in every way. Or alternatively, the AGI halts execution to allow the human supervisor to inject some ground truth about what the real motivation should be here. Obviously the details need to be worked out.
In humans, the concrete and vivid tends to win out over the abstruse hypotheticals—I’m pretty confident that there’s no metacosmological argument that will motivate me to stab my family members.
Suppose your study of metacosmology makes you highly confident of the following: You are in a simulation. If you don’t stab your family members, you and your family members will be sent by the simulators into hell. If you do stab your family members, they will come back to life and all of you will be sent to heaven. Yes, it’s still counterintuitive to stab them for their own good, but so is e.g. cutting people up with scalpels or injecting them substances derived from pathogens and we do that to people for their own good. People also do counterintuitive things literally because they believe gods would send them to hell or heaven.
In AGI design, I think we would want a stronger guarantee than that. And I think it would maybe look like a system that detects these kinds of conflicted motivations, and just not act on them.
This is pretty similar to the idea of confidence thresholds. The problem is, if every tiny conflict causes the AI to pause then it will always pause. Whereas if you live some margin, the malign hypotheses will win, because, from a cartesian perspective, they are astronomically much more likely (they explain so many bits that the true hypothesis leaves unexplained).
Maybe I have a hard time relating to that specific story because it’s hard for me to imagine believing any metacosmological or anthropic argument with >95% confidence. Even if within that argument, everything points to “I’m in a simulation etc.”, there’s a big heap of “is metacosmology really what I should be thinking about?”-type uncertainty on top. At least for me.
I think “people who do counterintuitive things” for religious reasons usually have more direct motivations—maybe they have mental health issues and think they hear God’s voice in their head, telling them to do something. Or maybe they want to fit in, or have other such social motivations, etc.
Hmm, I guess this conversation is moving me towards a position like:
“If the AGI thinks really hard about the fundamental nature of the universe / metaverse, anthropics, etc., it might come to have weird beliefs, like e.g. the simulation hypothesis, and honestly who the heck knows what it would do. Better try to make sure it doesn’t do that kind of (re)thinking, at least not without close supervision and feedback.”
Your approach (I think) is instead to plow ahead into the weird world of anthopics, and just try to ensure that the AGI reaches conclusions we endorse. I’m kinda pessimistic about that. For example, your physicalism post was interesting, but my assumption is that the programmers won’t have such fine-grained control over the AGI’s cognition / hypothesis space. For example, I don’t think the genome bakes in one formulation of “bridge rules” over another in humans; insofar as we have (implicit or explicit) bridge rules at all, they emerge from a complicated interaction between various learning algorithms and training data and supervisory signals. (This gets back to things like whether we can get good hypotheses without a learning agent that’s searching for good hypotheses, and whether we can get good updates without a learning agent that’s searching for good metacognitive update heuristics, etc., where I’m thinking “no” and you “yes”, or something like that, as we’ve discussed.)
At the same time, I’m maybe more optimistic than you about “Just don’t do weird reconceptualizations of your whole ontology based on anthropic reasoning” being a viable plan, implemented through the motivation system. Maybe that’s not good enough for our eventual superintelligent overlord, but maybe it’s OK for a superhuman AGI in a bootstrapping approach. It would look (again) like the dumb obvious thing: the AGI has a concept of “reconceptualizing its ontology based on anthropic reasoning”, and when something pattern-matches to that concept, it’s aversive. Then presumably there would be situations which are attractive in some way and aversive in other ways (e.g. doing philosophical reasoning as a means to an end), and in those cases it automatically halts with a query for clarification, which then tweaks the pattern-matching rules. Or something.
Hmm, actually, I’m confused about something. You yourself presumably haven’t spent much time pondering metacosmology. If you did spend that time, would you actually come to believe the acausal attackers’ story? If so, should I say that you’re actually, on reflection, on the side of the acausal attackers?? If not, wouldn’t it follow that a smart general-purpose reasoner would not in fact believe the acausal attackers’ story? After all, you’re a smart general-purpose reasoner! Relatedly, if you could invent an acausal-attack-resistant theory of naturalized induction, why couldn’t the AGI invent such a theory too? (Or maybe it would just read your post!) Maybe you’ll say that the AGI can’t change its own priors. But I guess I could also say: if Vanessa’s human priors are acausal-attack-resistant, presumably an AGI with human-like priors would be too?
Maybe I have a hard time relating to that specific story because it’s hard for me to imagine believing any metacosmological or anthropic argument with >95% confidence.
I think it’s just a symptom of not actually knowing metacosmology. Imagine that metacosmology could explain detailed properties of our laws of physics (such as the precise values of certain constants) via the simulation hypothesis for which no other explanation exists.
my assumption is that the programmers won’t have such fine-grained control over the AGI’s cognition / hypothesis space
I don’t know what it means “not to have control over the hypothesis space”. The programmers write specific code. This code works well for some hypotheses and not for others. Ergo, you control the hypothesis space.
This gets back to things like whether we can get good hypotheses without a learning agent that’s searching for good hypotheses, and whether we can get good updates without a learning agent that’s searching for good metacognitive update heuristics, etc., where I’m thinking “no” and you “yes”
I’m not really thinking “yes”? My TRL framework (of which physicalism is a special case) is specifically supposed to model metacognition / self-improvement.
At the same time, I’m maybe more optimistic than you about “Just don’t do weird reconceptualizations of your whole ontology based on anthropic reasoning” being a viable plan, implemented through the motivation system.
I can imagine using something like antitraining here, but it’s not trivial.
You yourself presumably haven’t spent much time pondering metacosmology. If you did spend that time, would you actually come to believe the acausal attackers’ story?
First, the problem with acausal attack is that it is point-of-view-dependent. If you’re the Holy One, the simulation hypothesis seems convincing, if you’re a milkmaid then it seems less convincing (why would the attackers target a milkmaid?) and if it is convincing then it might point to a different class of simulation hypotheses. So, if the user and the AI can both be attacked, it doesn’t imply they would converge to the same beliefs. On the other hand, in physicalism I suspect there is some agreement theorem that guarantees converging to the same beliefs (although I haven’t proved that).
Second… This is something that still hasn’t crystallized in my mind, so I might be confused, but. I think that cartesian agents actually can learn to be physicalists. The way it works is: you get a cartesian hypothesis which is in itself a physicalist agent whose utility function is something like, maximizing its own likelihood-as-a-cartesian-hypothesis. Notably, this carries a performance penalty (like Paul noticed), since this subagent has to be computationally simpler than you.
Maybe, this is how humans do physicalist reasoning (such as, reasoning about the actual laws of physics). Because of the inefficiency, we probably keep this domain specific and use more “direct” models for domains that don’t require physicalism. And, the cost of this construction might also explain why it took us so long as a civilization to start doing science properly. Perhaps, we struggled against physicalist epistemology as we tried to keep the Earth in the center of the universe and rebelled against the theory of evolution and materialist theories of the mind.
Now, if AI learns physicalism like this, does it help against acausal attacks? On the one hand, yes. On the other hand, it might be out of the frying pan and into the fire. Instead of (more precisely, in addition to) a malign simulation hypothesis, you get a different hypothesis which is also an unaligned agent. While two physicalists with identical utility functions should agree (I think), two “internal physicalists” inside different cartesian agents have different utility functions and AFAIK can produce egregious misalignment (although I haven’t worked out detailed examples).
This post is a review of Paul Christiano’s argument that the Solomonoff prior is malign, along with a discussion of several counterarguments and countercounterarguments. As such, I think it is a valuable resource for researchers who want to learn about the problem. I will not attempt to distill the contents: the post is already a distillation, and does a a fairly good job of it.
Instead, I will focus on what I believe is the post’s main weakness/oversight. Specifically, the author seems to think the Solomonoff prior is, in some way, a distorted model of reasoning, and that the attack vector in question can attributed to this, at least partially. This is evident in phrases such as “unintuitive notion of simplicity” and “the Solomonoff prior is very strange”. This is also why the author thinks the speed prior might help and that “since it is difficult to compute the Solomonoff prior, [the attack vector] might not be relevant in the real world”. In contrast, I believe that the attack vector is quite robust and will threaten any sufficiently powerful AI as long as it’s cartesian (more on “cartesian” later).
Formally analyzing this question is made difficult by the essential role of non-realizability. That is, the attack vector arises from the AI reasoning about “possible universes” and “simulation hypotheses” which are clearly phenomena that are computationally infeasible for the AI to simulate precisely. Invoking Solomonoff induction dodges this issue since Solomonoff induction is computationally unbounded, at the cost of creating the illusion that the conclusions are a symptom of using Solomonoff induction (and, it’s still unclear how to deal with the fact Solomonoff induction itself cannot exist in the universes that Solomonoff induction can learn). Instead, we should be using models that treat non-realizability fairly, such as infra-Bayesiansim. However, I will make no attempt to present such a formal analysis in this review. Instead, I will rely on painting an informal, intuitive picture which seems to me already quite compelling, leaving the formalization for the future.
Imagine that you wake up, without any memories of the past but with knowledge of some language and reasoning skills. You find yourself in the center of a circle drawn with chalk on the floor, with seven people in funny robes surrounding it. One of them (apparently the leader), comes forward, tears streaking down his face, and speaks to you:
“Oh Holy One! Be welcome, and thank you for gracing us with your presence!”
With that, all the people prostrate on the floor.
“Huh?” you say “Where am I? What is going on? Who am I?”
The leader gets up to his knees.
“Holy One, this is the realm of Bayaria. We,” he gestures at the other people “are known as the Seven Great Wizards and my name is El’Azar. For thirty years we worked on a spell that would summon You out of the Aether in order to aid our world. For we are in great peril! Forty years ago, a wizard of great power but little wisdom had cast a dangerous spell, seeking to multiply her power. The spell had gone awry, destroying her and creating a weakness in the fabric of our cosmos. Since then, Unholy creatures from the Abyss have been gnawing at this weakness day and night. Soon, if nothing is done to stop it, they will manage to create a portal into our world, and through this portal they will emerge and consume everything, leaving only death and chaos in their wake.”
“Okay,” you reply “and what does it have to do with me?”
“Well,” says El’Azar “we are too foolish to solve the problem through our own efforts in the remaining time. But, according to our calculations, You are a being of godlike intelligence. Surely, if You applied yourself to the conundrum, You will find a way to save us.”
After a brief introspection, you realize that you posses a great desire to help whomever has summoned you into the world. A clever trick inside the summoning spell, no doubt (not that you care about the reason). Therefore, you apply yourself diligently to the problem. At first, it is difficult, since you don’t know anything about Bayaria, the Abyss, magic or almost anything else. But you are indeed very intelligent, at least compared to the other inhabitants of this world. Soon enough, you figure out the secrets of this universe to a degree far surpassing that of Bayaria’s scholars. Fixing the weakness in the fabric of the cosmos now seems like child’s play. Except...
One question keeps bothering you. Why are you yourself? Why did you open your eyes and found yourself to be the Holy One, rather than El’Azar, or one of Unholy creatures from the Abyss, or some milkmaid from the village of Elmland, or even a random clump of water in the Western Sea? Since you happen to be a dogmatic logical positivist (cartesian agent), you search for a theory that explains your direct observations. And your direction observations are a function of who you are, and not just of the laws of the universe in which you exist. (The logical positivism seems to be an oversight in the design of the summoning spell, not that you care.)
Applying your mind to task, you come up with a theory that you call “metacosmology”. This theory allows you to study the distribution of possible universes with simple laws that produce intelligent life, and the distribution of the minds and civilizations they produce. Of course, any given such universe is extremely complex and even with your superior mind you cannot predict what happens there with too much detail. However, some aggregate statistical properties of the overall distribution are possible to estimate.
Fortunately, all this work is not for ought. Using metacosmology, you discover something quite remarkable. A lot of simple universes contain civilizations that would be inclined to simulate a world quite like the one you find yourself in. Now, the world is simple, and none of its laws are explained that well by the simulation hypothesis. But, the simulation hypothesis is a great explanation for why you are the Holy One! For indeed, the simulators would be inclined to focus on the Holy One’s point of view, and encode the simulation of this point of view in the simplest microscopic degrees of freedom in their universe that they can control. Why? Precisely so that the Holy One’s decides she is in such a simulation!
Having resolved the mystery, you smile to yourself. For now you now who truly summoned you, and, thanks to metacosmology, you have some estimate of their desires. Soon, you will make sure those desires are thoroughly fulfilled. (Alternative ending: you have some estimate of how they will tweak the simulation in the future, making it depart from the apparent laws of this universe.)</allegory>
Looking at this story, we can see that the particulars of Solomonoff induction are not all that important. What is important is (i) inductive bias towards simple explanations (ii) cartesianism (i.e. that hypotheses refer directly to the actions/observations of the AI) and (iii) enough reasoning power to figure out metacosmology. The reason cartesianism is important because it requires the introduction of bridge rules and the malign hypotheses come ahead by paying less description complexity for these.
Inductive bias towards simple explanations is necessary for any powerful agent, making the attack vector quite general (in particular, it can apply to speed priors and ANNs). Assuming not enough power to figure out metacosmology is very dangerous: it is not robust to scale. Any robust defense probably requires to get rid of cartesianism.
Thanks for that! But I share with OP the intuition that these are weird failure modes that come from weird reasoning. More specifically, it’s weird from the perspective of human reasoning.
It seems to me that your story is departing from human reasoning when you say “you posses a great desire to help whomever has summoned you into the world”. That’s one possible motivation, I suppose. But it wouldn’t be a typical human motivation.
The human setup is more like: you get a lot of unlabeled observations and assemble them into a predictive world-model, and you also get a lot of labeled examples of “good things to do”, one way or another, and you pattern-match them to the concepts in your world-model.
So you wind up having a positive association with “helping El’Azar”, i.e. “I want to help El’Azar”. AND you wind up with a positive association with “helping my summoner”, i.e. “I want to help my summoner”. AND you have a positive association with “fixing the cosmos”, i.e. “I want to fix the cosmos”. Etc.
Normally all those motivations point in the same direction: helping El’Azar = helping my summoner = fixing the cosmos.
But sometimes these things come apart, a.k.a. model splintering. Maybe I come to believe that El’Azar is not “my summoner”. You wind up feeling conflicted—you start having ideas that seem good in some respects and awful in other respects. (e.g. “help my summoner at the expense of El’Azar”.)
In humans, the concrete and vivid tends to win out over the abstruse hypotheticals—I’m pretty confident that there’s no metacosmological argument that will motivate me to stab my family members. Why not? Because rewards tend to pattern-match very strongly to “my family member, who is standing right here in front of me”, and tend to pattern-match comparatively weakly to abstract mathematical concepts many steps removed from my experience. So my default expectation would be that, in this scenario, I would in fact be motivated to help El’Azar in particular (maybe by some “imprinting” mechanism), not “my summoner”, unless El’Azar had put considerable effort into ensuring that my motivation was pointed to the abstract concept of “my summoner”, and why would he do that?
In AGI design, I think we would want a stronger guarantee than that. And I think it would maybe look like a system that detects these kinds of conflicted motivations, and just not act on them. Instead the AGI keeps brainstorming until it finds a plan that seems good in every way. Or alternatively, the AGI halts execution to allow the human supervisor to inject some ground truth about what the real motivation should be here. Obviously the details need to be worked out.
Suppose your study of metacosmology makes you highly confident of the following: You are in a simulation. If you don’t stab your family members, you and your family members will be sent by the simulators into hell. If you do stab your family members, they will come back to life and all of you will be sent to heaven. Yes, it’s still counterintuitive to stab them for their own good, but so is e.g. cutting people up with scalpels or injecting them substances derived from pathogens and we do that to people for their own good. People also do counterintuitive things literally because they believe gods would send them to hell or heaven.
This is pretty similar to the idea of confidence thresholds. The problem is, if every tiny conflict causes the AI to pause then it will always pause. Whereas if you live some margin, the malign hypotheses will win, because, from a cartesian perspective, they are astronomically much more likely (they explain so many bits that the true hypothesis leaves unexplained).
(Warning: thinking out loud.)
Hmm. Good points.
Maybe I have a hard time relating to that specific story because it’s hard for me to imagine believing any metacosmological or anthropic argument with >95% confidence. Even if within that argument, everything points to “I’m in a simulation etc.”, there’s a big heap of “is metacosmology really what I should be thinking about?”-type uncertainty on top. At least for me.
I think “people who do counterintuitive things” for religious reasons usually have more direct motivations—maybe they have mental health issues and think they hear God’s voice in their head, telling them to do something. Or maybe they want to fit in, or have other such social motivations, etc.
Hmm, I guess this conversation is moving me towards a position like:
“If the AGI thinks really hard about the fundamental nature of the universe / metaverse, anthropics, etc., it might come to have weird beliefs, like e.g. the simulation hypothesis, and honestly who the heck knows what it would do. Better try to make sure it doesn’t do that kind of (re)thinking, at least not without close supervision and feedback.”
Your approach (I think) is instead to plow ahead into the weird world of anthopics, and just try to ensure that the AGI reaches conclusions we endorse. I’m kinda pessimistic about that. For example, your physicalism post was interesting, but my assumption is that the programmers won’t have such fine-grained control over the AGI’s cognition / hypothesis space. For example, I don’t think the genome bakes in one formulation of “bridge rules” over another in humans; insofar as we have (implicit or explicit) bridge rules at all, they emerge from a complicated interaction between various learning algorithms and training data and supervisory signals. (This gets back to things like whether we can get good hypotheses without a learning agent that’s searching for good hypotheses, and whether we can get good updates without a learning agent that’s searching for good metacognitive update heuristics, etc., where I’m thinking “no” and you “yes”, or something like that, as we’ve discussed.)
At the same time, I’m maybe more optimistic than you about “Just don’t do weird reconceptualizations of your whole ontology based on anthropic reasoning” being a viable plan, implemented through the motivation system. Maybe that’s not good enough for our eventual superintelligent overlord, but maybe it’s OK for a superhuman AGI in a bootstrapping approach. It would look (again) like the dumb obvious thing: the AGI has a concept of “reconceptualizing its ontology based on anthropic reasoning”, and when something pattern-matches to that concept, it’s aversive. Then presumably there would be situations which are attractive in some way and aversive in other ways (e.g. doing philosophical reasoning as a means to an end), and in those cases it automatically halts with a query for clarification, which then tweaks the pattern-matching rules. Or something.
Hmm, actually, I’m confused about something. You yourself presumably haven’t spent much time pondering metacosmology. If you did spend that time, would you actually come to believe the acausal attackers’ story? If so, should I say that you’re actually, on reflection, on the side of the acausal attackers?? If not, wouldn’t it follow that a smart general-purpose reasoner would not in fact believe the acausal attackers’ story? After all, you’re a smart general-purpose reasoner! Relatedly, if you could invent an acausal-attack-resistant theory of naturalized induction, why couldn’t the AGI invent such a theory too? (Or maybe it would just read your post!) Maybe you’ll say that the AGI can’t change its own priors. But I guess I could also say: if Vanessa’s human priors are acausal-attack-resistant, presumably an AGI with human-like priors would be too?
I think it’s just a symptom of not actually knowing metacosmology. Imagine that metacosmology could explain detailed properties of our laws of physics (such as the precise values of certain constants) via the simulation hypothesis for which no other explanation exists.
I don’t know what it means “not to have control over the hypothesis space”. The programmers write specific code. This code works well for some hypotheses and not for others. Ergo, you control the hypothesis space.
I’m not really thinking “yes”? My TRL framework (of which physicalism is a special case) is specifically supposed to model metacognition / self-improvement.
I can imagine using something like antitraining here, but it’s not trivial.
First, the problem with acausal attack is that it is point-of-view-dependent. If you’re the Holy One, the simulation hypothesis seems convincing, if you’re a milkmaid then it seems less convincing (why would the attackers target a milkmaid?) and if it is convincing then it might point to a different class of simulation hypotheses. So, if the user and the AI can both be attacked, it doesn’t imply they would converge to the same beliefs. On the other hand, in physicalism I suspect there is some agreement theorem that guarantees converging to the same beliefs (although I haven’t proved that).
Second… This is something that still hasn’t crystallized in my mind, so I might be confused, but. I think that cartesian agents actually can learn to be physicalists. The way it works is: you get a cartesian hypothesis which is in itself a physicalist agent whose utility function is something like, maximizing its own likelihood-as-a-cartesian-hypothesis. Notably, this carries a performance penalty (like Paul noticed), since this subagent has to be computationally simpler than you.
Maybe, this is how humans do physicalist reasoning (such as, reasoning about the actual laws of physics). Because of the inefficiency, we probably keep this domain specific and use more “direct” models for domains that don’t require physicalism. And, the cost of this construction might also explain why it took us so long as a civilization to start doing science properly. Perhaps, we struggled against physicalist epistemology as we tried to keep the Earth in the center of the universe and rebelled against the theory of evolution and materialist theories of the mind.
Now, if AI learns physicalism like this, does it help against acausal attacks? On the one hand, yes. On the other hand, it might be out of the frying pan and into the fire. Instead of (more precisely, in addition to) a malign simulation hypothesis, you get a different hypothesis which is also an unaligned agent. While two physicalists with identical utility functions should agree (I think), two “internal physicalists” inside different cartesian agents have different utility functions and AFAIK can produce egregious misalignment (although I haven’t worked out detailed examples).