If you’re saying “let’s think about a more general class of agents because EU maximization is unrealistic”, that’s fair, but note that you’re potentially making the problem more difficult by trying to deal with a larger class with fewer invariants.
If you’re saying “let’s think about a distinct but not more general class of agents because that will be more alignable”, then maybe, and it’d be useful to say what the class is, but: you’re going to have trouble aligning something if you can’t even know that it has some properties that are stable under self-reflection. An EU maximizer is maybe close to being stable under self-reflection and self-modification. That makes it attractive as a theoretical tool: e.g. maybe you can point at a good utility function, and then get a good prediction of what actually happens, relying on reflective stability; or e.g. maybe you can find nearby neighbors to EU maximization that are still reflectively stable and easier to align. It makes sense to try starting from scratch, but IMO this is a key thing that any approach will probably have to deal with.
There’s naturality as in “what does it look like, the very first thing that is just barely generally capable enough to register as a general intelligence?”, and there’s naturality as in “what does it look like, a highly capable thing that has read-write access to itself?”. Both interesting and relevant, but the latter question is in some ways an easier question to answer, and in some ways easier to answer alignment questions about. This is analogous to unbounded analysis: https://arbital.com/p/unbounded_analysis/
In other words, we can’t even align an EU maximizer, and EU maximizers have to some extent already simplified away much of the problem (e.g. the problems coming from more unconstrained self-modification).
You seem to try to bail out EU maximisation as the model because it is a limit of agency, in some sense. I don’t think this is the case.
In classical and quantum derivations of the Free Energy Principle, it is shown that the limit is the perfect predictive capability of the agent’s environment (or, more pedantically: in classic formulation, FEP is derived from basic statistical mechanics; in quantum formulation, it’s more of being postulated, but it is shown that quantum FEP in the limit is equivalent to the Unitarity Principle). Also, Active Inference, the process theory which is derived from the FEP, can be seen as a formalisation of instrumental convergence.
So, we can informally outline the “stages of life” of a self-modifying agent as follows: general intelligence → maximal instrumental convergence → maximal prediction of the environment → maximal entanglement with the environment.
What you’ve said so far doesn’t seem to address my comments, or make it clear to me what the relevant of the FEP is. I also don’t understand the FEP or the point of the FEP. I’m not saying EU maximizers are reflectively stable or a limit of agency, I’m saying that EU maximization is the least obviously reflectively unstable thing I’m aware of.
I said that the limit of agency is already proposed, from the physical perspective (FEP). And this limit is not EU maximisation. So, methodologically, you should either criticise this proposal, or suggest an alternative theory that is better, or take the proposal seriously.
If you take the proposal seriously (I do): the limit appears to be “uninteresting”. A maximally entangled system is “nothing”, it’s perceptibly indistinguishable from its environment, for a third-person observer (let’s say, in Tegmark’s tripartite partition system-environment-observer). There is no other limit. Instrumental convergence is not the limit, a strong instrumentally convergent system is still far from the limit.
This suggests that unbounded analysis, “thinking to the limit” is not useful, in this particular situation.
Any physical theory of agency must ensure “reflective stability”, by construction. I definitely don’t sense anything “reflectively unstable” in Active Inference, because it’s basically the theory of self-evidencing, and wields instrumental convergence in service of this self-evidencing. Who wouldn’t “want” this, reflectively? Active Inference agents in some sense must want this by construction because they want to be themselves, as long as possible. However they redefine themselves, and at that very moment, they also want to be themselves (redefined). The only logical possibility out of this is to not want to exist at all at some point, i. e., commit suicide, which agents (e. g., humans) actually do sometimes. But conditioned on that they want to continue to exist, they are definitely reflectively stable.
If you’re saying “let’s think about a more general class of agents because EU maximization is unrealistic”, that’s fair, but note that you’re potentially making the problem more difficult by trying to deal with a larger class with fewer invariants.
If you’re saying “let’s think about a distinct but not more general class of agents because that will be more alignable”, then maybe, and it’d be useful to say what the class is, but: you’re going to have trouble aligning something if you can’t even know that it has some properties that are stable under self-reflection. An EU maximizer is maybe close to being stable under self-reflection and self-modification. That makes it attractive as a theoretical tool: e.g. maybe you can point at a good utility function, and then get a good prediction of what actually happens, relying on reflective stability; or e.g. maybe you can find nearby neighbors to EU maximization that are still reflectively stable and easier to align. It makes sense to try starting from scratch, but IMO this is a key thing that any approach will probably have to deal with.
I strongly suspect that expected utility maximisers are anti-natural for selection for general capabilities.
There’s naturality as in “what does it look like, the very first thing that is just barely generally capable enough to register as a general intelligence?”, and there’s naturality as in “what does it look like, a highly capable thing that has read-write access to itself?”. Both interesting and relevant, but the latter question is in some ways an easier question to answer, and in some ways easier to answer alignment questions about. This is analogous to unbounded analysis: https://arbital.com/p/unbounded_analysis/
In other words, we can’t even align an EU maximizer, and EU maximizers have to some extent already simplified away much of the problem (e.g. the problems coming from more unconstrained self-modification).
You seem to try to bail out EU maximisation as the model because it is a limit of agency, in some sense. I don’t think this is the case.
In classical and quantum derivations of the Free Energy Principle, it is shown that the limit is the perfect predictive capability of the agent’s environment (or, more pedantically: in classic formulation, FEP is derived from basic statistical mechanics; in quantum formulation, it’s more of being postulated, but it is shown that quantum FEP in the limit is equivalent to the Unitarity Principle). Also, Active Inference, the process theory which is derived from the FEP, can be seen as a formalisation of instrumental convergence.
So, we can informally outline the “stages of life” of a self-modifying agent as follows: general intelligence → maximal instrumental convergence → maximal prediction of the environment → maximal entanglement with the environment.
What you’ve said so far doesn’t seem to address my comments, or make it clear to me what the relevant of the FEP is. I also don’t understand the FEP or the point of the FEP. I’m not saying EU maximizers are reflectively stable or a limit of agency, I’m saying that EU maximization is the least obviously reflectively unstable thing I’m aware of.
I said that the limit of agency is already proposed, from the physical perspective (FEP). And this limit is not EU maximisation. So, methodologically, you should either criticise this proposal, or suggest an alternative theory that is better, or take the proposal seriously.
If you take the proposal seriously (I do): the limit appears to be “uninteresting”. A maximally entangled system is “nothing”, it’s perceptibly indistinguishable from its environment, for a third-person observer (let’s say, in Tegmark’s tripartite partition system-environment-observer). There is no other limit. Instrumental convergence is not the limit, a strong instrumentally convergent system is still far from the limit.
This suggests that unbounded analysis, “thinking to the limit” is not useful, in this particular situation.
Any physical theory of agency must ensure “reflective stability”, by construction. I definitely don’t sense anything “reflectively unstable” in Active Inference, because it’s basically the theory of self-evidencing, and wields instrumental convergence in service of this self-evidencing. Who wouldn’t “want” this, reflectively? Active Inference agents in some sense must want this by construction because they want to be themselves, as long as possible. However they redefine themselves, and at that very moment, they also want to be themselves (redefined). The only logical possibility out of this is to not want to exist at all at some point, i. e., commit suicide, which agents (e. g., humans) actually do sometimes. But conditioned on that they want to continue to exist, they are definitely reflectively stable.
I’m talking about reflective stability. Are you saying that all agents will eventually self modify into FEP, and FEP is a rock?
Reward is not Necessary: How to Create a Compositional Self-Preserving Agent for Life-Long Learning