However, simplicity is not enough to e.g. select among all the possible objectives which might have explained the behavior of a human. Quite bluntly, Occam’s razor is insufficient to infer the preferences of irrational agents. Humans (i.e. irrational agents) appear not to actually value the simplest thing which would explain their behavior. For another failure mode, consider the notion acausal attack highlighted by Vanessa Kossoy here, or in terms of the fact that the Solomonoff prior is malign. In this scenario, an AGI incentivized to keep it simple might conclude that the shortest explanation of how the universe works is “[The Great Old One] has been running all possible worlds based on all possible physics, including us.” This inference might allegedly incentivize the AGI to defer to The Great Old One as its causal precursor, therefore missing us in the process. Entertaining related ideas with a sprinkle of anthropics has a tendency to get you into the unproductive state of wondering whether you’re in the middle of a computation being run by a language model on another plane of existence which is being prompted to generate a John Wentworth post.
I feel a lot of the problem relates to an Extremal Goodhart effect, where in the popular imagination views simulations as not equivalent to reality.
However my guess is that simplicity, not speed or stability priors are the default.
I feel a lot of the problem relates to an Extremal Goodhart effect, where in the popular imagination views simulations as not equivalent to reality.
That seems right, but aren’t all those heuristics prone to Goodharting? If your prior distribution is extremely sharp and you barely update from it, it seems likely that you run into all those various failure modes.
However my guess is that simplicity, not speed or stability priors are the default.
Not sure what you mean by default here. Likely to be used, effective, or?
I actually should focus on the circuit complexity prior, but my view is that due to the smallness of agents compared to reality, that they must generalize very well to new environments, which pushes in a simplicity direction.
I feel a lot of the problem relates to an Extremal Goodhart effect, where in the popular imagination views simulations as not equivalent to reality.
However my guess is that simplicity, not speed or stability priors are the default.
That seems right, but aren’t all those heuristics prone to Goodharting? If your prior distribution is extremely sharp and you barely update from it, it seems likely that you run into all those various failure modes.
Not sure what you mean by default here. Likely to be used, effective, or?
I actually should focus on the circuit complexity prior, but my view is that due to the smallness of agents compared to reality, that they must generalize very well to new environments, which pushes in a simplicity direction.