The post is still largely up-to-date. In the intervening year, I mostly worked on the theory of regret bounds for infra-Bayesian bandits, and haven’t made much progress on open problems in infra-Bayesian physicalism. On the other hand, I also haven’t found any new problems with the framework.
The strongest objection to this formalism is the apparent contradiction between the monotonicity principle and the sort of preferences humans have. While my thinking about this problem evolved a little, I am still at a spot where every solution I know requires biting a strange philosophical bullet. On the other hand, IBP is still my best guess about naturalized induction, and, more generally, about the conjectured “attractor submanifold” in the space of minds, i.e. the type of mind to which all sufficiently advanced minds eventually converge.
One important development that did happen is my invention of the PreDCA alignment protocol, which critically depends on IBP. I consider PreDCA to be the most promising direction I know at present to solving alignment, and an important (informal) demonstration of the potential of the IBP formalism.
fwiw that strange philosophical bullet fits remarkably well with a set of thoughts I had while reading Anthropic Bias about ‘amount of existence’ being the fundamental currency of reality (a bunch of the anthropic paradoxes felt like they were showing that if you traded sufficiently large amounts of “patterns like me exist more” then you could get counterintuitive results like bending the probabilities of the world around you without any causal pathway), and infraBayes requiring it actually updated me a little towards infraBayes being on the right track.
My model of why humans seem to prefer non-existence to existence in some cases is that our ancestors faced situations which could reduce their ability to self-propagate to almost zero, and needed to avoid these really hard. Evolution gave us training signals which can easily generate subagents which are single-mindedly obsessed with avoiding certain kinds of intense suffering. This motivates us to avoid a wide range of realistic things which cost us existence, but as a side-effect of being emphasized so much make it possible to tip into suicidality in cases where, in our history, it was not too costly because things were bad enough anyway that the agent wouldn’t propagate much (suicide when the cues for self-propagation being relatively likely for on-distribution humans should have been weeded out). This strikes me as unintended and a result of a hack which works pretty well on-distribution, and likely not reflectively consistent in the limit. An evolution which could generate brains with unbounded compute would not make agents which ever preferred suicide or non-existence.
Another angle on this is thinking of evolution having set things up for a sign-flipped subagent to be reinforced, which just wants to Not Be. This is not a natural shape for an agent to be, but it’s useful enough that the pattern to generate it is common.
This is all pretty handwave-y and I don’t claim high confidence that it’s correct or useful, but might be interesting babble.
The post is still largely up-to-date. In the intervening year, I mostly worked on the theory of regret bounds for infra-Bayesian bandits, and haven’t made much progress on open problems in infra-Bayesian physicalism. On the other hand, I also haven’t found any new problems with the framework.
The strongest objection to this formalism is the apparent contradiction between the monotonicity principle and the sort of preferences humans have. While my thinking about this problem evolved a little, I am still at a spot where every solution I know requires biting a strange philosophical bullet. On the other hand, IBP is still my best guess about naturalized induction, and, more generally, about the conjectured “attractor submanifold” in the space of minds, i.e. the type of mind to which all sufficiently advanced minds eventually converge.
One important development that did happen is my invention of the PreDCA alignment protocol, which critically depends on IBP. I consider PreDCA to be the most promising direction I know at present to solving alignment, and an important (informal) demonstration of the potential of the IBP formalism.
fwiw that strange philosophical bullet fits remarkably well with a set of thoughts I had while reading Anthropic Bias about ‘amount of existence’ being the fundamental currency of reality (a bunch of the anthropic paradoxes felt like they were showing that if you traded sufficiently large amounts of “patterns like me exist more” then you could get counterintuitive results like bending the probabilities of the world around you without any causal pathway), and infraBayes requiring it actually updated me a little towards infraBayes being on the right track.
My model of why humans seem to prefer non-existence to existence in some cases is that our ancestors faced situations which could reduce their ability to self-propagate to almost zero, and needed to avoid these really hard. Evolution gave us training signals which can easily generate subagents which are single-mindedly obsessed with avoiding certain kinds of intense suffering. This motivates us to avoid a wide range of realistic things which cost us existence, but as a side-effect of being emphasized so much make it possible to tip into suicidality in cases where, in our history, it was not too costly because things were bad enough anyway that the agent wouldn’t propagate much (suicide when the cues for self-propagation being relatively likely for on-distribution humans should have been weeded out). This strikes me as unintended and a result of a hack which works pretty well on-distribution, and likely not reflectively consistent in the limit. An evolution which could generate brains with unbounded compute would not make agents which ever preferred suicide or non-existence.
Another angle on this is thinking of evolution having set things up for a sign-flipped subagent to be reinforced, which just wants to Not Be. This is not a natural shape for an agent to be, but it’s useful enough that the pattern to generate it is common.
This is all pretty handwave-y and I don’t claim high confidence that it’s correct or useful, but might be interesting babble.