Maybe I have a hard time relating to that specific story because it’s hard for me to imagine believing any metacosmological or anthropic argument with >95% confidence.
I think it’s just a symptom of not actually knowing metacosmology. Imagine that metacosmology could explain detailed properties of our laws of physics (such as the precise values of certain constants) via the simulation hypothesis for which no other explanation exists.
my assumption is that the programmers won’t have such fine-grained control over the AGI’s cognition / hypothesis space
I don’t know what it means “not to have control over the hypothesis space”. The programmers write specific code. This code works well for some hypotheses and not for others. Ergo, you control the hypothesis space.
This gets back to things like whether we can get good hypotheses without a learning agent that’s searching for good hypotheses, and whether we can get good updates without a learning agent that’s searching for good metacognitive update heuristics, etc., where I’m thinking “no” and you “yes”
I’m not really thinking “yes”? My TRL framework (of which physicalism is a special case) is specifically supposed to model metacognition / self-improvement.
At the same time, I’m maybe more optimistic than you about “Just don’t do weird reconceptualizations of your whole ontology based on anthropic reasoning” being a viable plan, implemented through the motivation system.
I can imagine using something like antitraining here, but it’s not trivial.
You yourself presumably haven’t spent much time pondering metacosmology. If you did spend that time, would you actually come to believe the acausal attackers’ story?
First, the problem with acausal attack is that it is point-of-view-dependent. If you’re the Holy One, the simulation hypothesis seems convincing, if you’re a milkmaid then it seems less convincing (why would the attackers target a milkmaid?) and if it is convincing then it might point to a different class of simulation hypotheses. So, if the user and the AI can both be attacked, it doesn’t imply they would converge to the same beliefs. On the other hand, in physicalism I suspect there is some agreement theorem that guarantees converging to the same beliefs (although I haven’t proved that).
Second… This is something that still hasn’t crystallized in my mind, so I might be confused, but. I think that cartesian agents actually can learn to be physicalists. The way it works is: you get a cartesian hypothesis which is in itself a physicalist agent whose utility function is something like, maximizing its own likelihood-as-a-cartesian-hypothesis. Notably, this carries a performance penalty (like Paul noticed), since this subagent has to be computationally simpler than you.
Maybe, this is how humans do physicalist reasoning (such as, reasoning about the actual laws of physics). Because of the inefficiency, we probably keep this domain specific and use more “direct” models for domains that don’t require physicalism. And, the cost of this construction might also explain why it took us so long as a civilization to start doing science properly. Perhaps, we struggled against physicalist epistemology as we tried to keep the Earth in the center of the universe and rebelled against the theory of evolution and materialist theories of the mind.
Now, if AI learns physicalism like this, does it help against acausal attacks? On the one hand, yes. On the other hand, it might be out of the frying pan and into the fire. Instead of (more precisely, in addition to) a malign simulation hypothesis, you get a different hypothesis which is also an unaligned agent. While two physicalists with identical utility functions should agree (I think), two “internal physicalists” inside different cartesian agents have different utility functions and AFAIK can produce egregious misalignment (although I haven’t worked out detailed examples).
I think it’s just a symptom of not actually knowing metacosmology. Imagine that metacosmology could explain detailed properties of our laws of physics (such as the precise values of certain constants) via the simulation hypothesis for which no other explanation exists.
I don’t know what it means “not to have control over the hypothesis space”. The programmers write specific code. This code works well for some hypotheses and not for others. Ergo, you control the hypothesis space.
I’m not really thinking “yes”? My TRL framework (of which physicalism is a special case) is specifically supposed to model metacognition / self-improvement.
I can imagine using something like antitraining here, but it’s not trivial.
First, the problem with acausal attack is that it is point-of-view-dependent. If you’re the Holy One, the simulation hypothesis seems convincing, if you’re a milkmaid then it seems less convincing (why would the attackers target a milkmaid?) and if it is convincing then it might point to a different class of simulation hypotheses. So, if the user and the AI can both be attacked, it doesn’t imply they would converge to the same beliefs. On the other hand, in physicalism I suspect there is some agreement theorem that guarantees converging to the same beliefs (although I haven’t proved that).
Second… This is something that still hasn’t crystallized in my mind, so I might be confused, but. I think that cartesian agents actually can learn to be physicalists. The way it works is: you get a cartesian hypothesis which is in itself a physicalist agent whose utility function is something like, maximizing its own likelihood-as-a-cartesian-hypothesis. Notably, this carries a performance penalty (like Paul noticed), since this subagent has to be computationally simpler than you.
Maybe, this is how humans do physicalist reasoning (such as, reasoning about the actual laws of physics). Because of the inefficiency, we probably keep this domain specific and use more “direct” models for domains that don’t require physicalism. And, the cost of this construction might also explain why it took us so long as a civilization to start doing science properly. Perhaps, we struggled against physicalist epistemology as we tried to keep the Earth in the center of the universe and rebelled against the theory of evolution and materialist theories of the mind.
Now, if AI learns physicalism like this, does it help against acausal attacks? On the one hand, yes. On the other hand, it might be out of the frying pan and into the fire. Instead of (more precisely, in addition to) a malign simulation hypothesis, you get a different hypothesis which is also an unaligned agent. While two physicalists with identical utility functions should agree (I think), two “internal physicalists” inside different cartesian agents have different utility functions and AFAIK can produce egregious misalignment (although I haven’t worked out detailed examples).