Assuming an inner aligned AI system (that is, an AI system with no misaligned inner optimizers), if we have a goal described in a way that is robust to ontological shifts due to the Natural Abstractions Hypothesis holding in some way (specifically, what I have in mind is formally specified goals like QACI, since I expect that mathematical abstractions are robust to ontological shifts), then one can simply[1] provide an this AI system this goal and allow it to do whatever it considers necessary to maximize that goal.
I do not believe this alignment strategy requires a control feedback loop at all. And I do believe that retaining control over an AI as it rapidly improves capabilities is perhaps a quixotic goal.
So no, I am not pointing at the distinction between ‘implicit/aligned control’ and ‘delegated control’ as terms used in the paper. From the paper:
Delegated control agent decides for itself the subject’s desire that is long-term-best for the subject and acts on it.
Well, in the example given above, the agent doesn’t decide for itself what the subject’s desire is: it simply optimizes for its own desire. The work of deciding what is ‘long-term-best for the subject’ does not happen unless that is actually what the goal specifies.
if we have a goal described in a way that is robust to ontological shifts due to the Natural Abstractions Hypothesis holding in some way, then one can simply provide an this AI system this goal and allow it to do whatever it considers necessary to maximize that goal.
This is not a sound assumption when it comes to continued implementation in the outside world. Therefore, reasoning based on that assumption about how alignment would work within a mathematical toy model is also unsound.
Could you link (or describe) a better explanation for why you believe that the Natural Abstraction Hypothesis (or a goal described in a way that is robust to ontological shifts; I consider both equivalent) is not a sound assumption? Because in such a case I believe we are mostly doomed. I don’t expect the ‘control problem’ to be solvable or consider that it makes sense for humanity to be able to have a leash on something superintelligent that can have a shift in its preferences.
That assumption is unsound with respect to what is sufficient for maintaining goal-directedness.
Any empirically-sound answer to the question of whether there is some way to describe a goal that is robust to ontological shifts (ie. define goals with respect to context-invariant perception of regular aspects of the environment, eg. somehow define diamonds by perception of tetrahedral carbon bonds) is still insufficient for solving the long-term safety of AGI.
This because what we are dealing with is machinery that continues to self-learn code from inputs, and continues to self-modify by replacing broken parts (perfect hardware copies are infeasible).
Which the machinery will need to do to be self-sufficient. Ie. to adapt to the environment, to survive as an agent.
Natural abstractions are also leaky abstractions. Meaning that even if AGI could internally define a goal robustly with respect to natural abstractions, AGI cannot conceptually contain within their modelling of natural abstractions all but a tiny portion of the (side-)effects propagating through the environment – as a result of all the interactions of the machinery’s functional components with connected physical surroundings.
Where such propagated effects will feed back into: - changes in the virtualised code learned by the machinery based on sensor inputs. - changes in the hardware configurations, at various levels of dependency, based on which continued to exist and replicate.
We need to define the problem comprehensively enough. The scope of application of “Is there a way to define a goal in a way that is robust to ontological shifts” is not sufficient to address the overarching question “Can AGI be controlled to stay safe?”.
To state the problem comprehensively enough, you need include the global feedback dynamics that would necessarily happen through any AGI (as ‘self-sufficient learning machinery’) over time.
~ ~ ~ Here is also a relevant passage from the link I shared above:
- that saying/claiming that *some* aspects, at some levels of abstraction, that some things are sometimes generally predictable is not to say that _all_ aspects are _always_ completely predictable, at all levels of abstraction.
- that localized details that are filtered out from content or irreversibly distorted in the transmission of that content over distances nevertheless can cause large-magnitude impacts over significantly larger spatial scopes.
- that so-called ‘natural abstractions’ represented within the mind of a distant observer cannot be used to accurately and comprehensively simulate the long-term consequences of chaotic interactions between tiny-scope, tiny-magnitude (below measurement threshold) changes in local conditions.
- that abstractions cannot capture phenomena that are highly sensitive to such tiny changes except as post-hoc categorizations/analysis of the witnessed final conditions.
- where given actual microstate amplification phenomena associated with all manner of non-linear phenomena, particularly that commonly observed in all sorts of complex systems, up to and especially including organic biological humans, then it *can* be legitimately claimed, based on the fact of their being a kind of hard randomness associated with the atomic physics underlying all of the organic chemistry that in fact (more than in principle), that humans (and AGI) are inherently unpredictable, in at least some aspect, *all* of the time.
No, the way I used the term was to point to robust abstractions to ontological concepts. Here’s an example: Say 1+1=A. A here obviously means 2 in our language, but it doesn’t change what A represents, ontologically. If A+1=4, then you have broken math, and that results in you being less capable in your reasoning and being “dutch booked”. Your world model is then incorrect, and it is very unlikely that any ontological shift will result in such a break in world model capabilities.
Math is a robust abstraction. “Natural abstractions”, as I use the term, points to abstractions for objects in the real world that share the same level of robustness to ontological shifts, such that as an AI gets better and better at modelling the world, its ontology tends more towards representing the objects in question with these abstractions.
Meaning that even* if* AGI could internally define a goal robustly with respect to natural abstractions, AGI cannot conceptually contain within their modelling of natural abstractions all but a tiny portion of the (side-)effects propagating through the environment – as a result of all the interactions of the machinery’s functional components with connected physical surroundings.
That seems like a claim about the capabilities of arbitrarily powerful AI systems, one that relies on chaos theory or complex systems theory. I share your sentiment but doubt that things such as successor AI alignment will be difficult for ASIs.
I agree that natural abstractions would tend to get selected for in the agents that continue to exist and gain/uphold power to make changes in the world. Including because of Dutch-booking of incoherent preferences, because of instrumental convergence, and because relatively poorly functioning agents get selected out of the population.
However, those natural abstractions are still leaky in a sense similar to how platonic concepts are leaky abstractions. The natural abstraction of a circle does not map precisely to the actual physical shape of eg. a wheel identified to exist in the outside world.
In this sense, whatever natural abstractions AGI would use that allow the learning machinery to compress observations of actual physical instantiations of matter or energetic interactions in their modelling of the outside world, those natural abstractions would still fail to capture all the long-term-relevant features in the outside world.
This point I’m sure is obvious to you. But it bears repeating.
That seems like a claim about the capabilities of arbitrarily powerful AI systems,
Yes, or more specifically: about fundamental limits of any AI system to control how its (side)-effects propagate and feed back over time.
one that relies on chaos theory or complex systems theory.
Pretty much. Where “complex” refers to both internal algorithmic complexity (NP-computation branches, etc) and physical functional complexity (distributed non-linear amplifying feedback, etc).
I share your sentiment but doubt that things but doubt that things such as successor AI alignment will be difficult for ASIs.
This is not an argument. Given that people here are assessing what to do about x-risks, they should not rely on you stating your “doubt that...alignment will be difficult”.
I doubt that you thought this through comprehensively enough, and that your reasoning addresses the fundamental limits to controllability I summarised in this post.
The burden of proof is on you to comprehensively clarify your reasoning, given that you are in effect claiming that extinction risks can be engineered away.
You’d need to clarify specifically why functional components iteratively learned/assembled within AGI could have long-term predictable effects in physical interactions with shifting connected surroundings of a more physically complex outside world.
I don’t mind whether that’s framed as “AGI redesigns a successor version of their physically instantiated components” or “AGI keeps persisting in some modified form”.
This because what we are dealing with is machinery that continues to self-learn code from inputs, and continues to self-modify by replacing broken parts (perfect hardware copies are infeasible).
Pretty sure that the problem of ensuring successor AIs are aligned to their predecessors is one that can be delegated to a capable and aligned AI. Asking for “perfect hardware copies” misses the point, in my opinion: it seems like you want me to accept that just because there isn’t a 100% chance of AI-to-AI successor alignment, humanity must attempt to retain continued control over the AI. Humanity is already less capable than the predecessor AI in my model, so trying to retain control would reliably lead to worse outcomes.
Actually, that is switching to reasoning about something else.
Reasoning that the alternative (humans interacting with each other) would lead to reliably worse outcomes is not the same as reasoning about why AGI stay aligned in its effects on the world to stay safe to humans.
And with that switch, you are not addressing Nate Soares’ point that “capabilities generalize better than alignment”.
Nate Soares’ point did not depend on complex systems dynamics causing tiny miscalibrations to blow up into massive issues. The entire point of that essay is to show how ontological shifts are a major problem for alignment robustness.
I expect that AIs will be good enough at epistemology to do competent error correction and the problems you seem overly focused on are irrelevant.
Do you believe that all attempts at alignment are flawed and that we should stop building powerful ASIs entirely? I can’t quite get what your belief is.
Thanks, reading the post again, I do see quite a lot of emphasis on ontological shifts:
”Then, the system takes that sharp left turn, and, predictably, the capabilities quickly improve outside of its training distribution, while the alignment falls apart.”
I expect that AIs will be good enough at epistemology to do competent error correction and the problems you seem overly focused on are irrelevant.
How do you know that the degree of error correction possible will be sufficient to have any sound and valid guarantee of long-term AI safety?
Again, people really cannot rely on your personal expectation when it comes to machinery that could lead to the deaths of everyone . I’m looking for specific, well-thought-through arguments.
Do you believe that all attempts at alignment are flawed and that we should stop building powerful ASIs entirely?
Yes, that is the conclusion based on me probing my mentor’s argumentation for 1.5 years, and concluding that the empirical premises are sound and the reasoning logically consistent.
Can you think of any example of an alignment method being implemented soundly in practice without use of a control feedback loop?
Assuming an inner aligned AI system (that is, an AI system with no misaligned inner optimizers), if we have a goal described in a way that is robust to ontological shifts due to the Natural Abstractions Hypothesis holding in some way (specifically, what I have in mind is formally specified goals like QACI, since I expect that mathematical abstractions are robust to ontological shifts), then one can simply[1] provide an this AI system this goal and allow it to do whatever it considers necessary to maximize that goal.
I do not believe this alignment strategy requires a control feedback loop at all. And I do believe that retaining control over an AI as it rapidly improves capabilities is perhaps a quixotic goal.
So no, I am not pointing at the distinction between ‘implicit/aligned control’ and ‘delegated control’ as terms used in the paper. From the paper:
Well, in the example given above, the agent doesn’t decide for itself what the subject’s desire is: it simply optimizes for its own desire. The work of deciding what is ‘long-term-best for the subject’ does not happen unless that is actually what the goal specifies.
For certain definitions of “simply”.
This is not a sound assumption when it comes to continued implementation in the outside world. Therefore, reasoning based on that assumption about how alignment would work within a mathematical toy model is also unsound.
https://mflb.com/ai_alignment_1/si_safety_qanda_out.html#p9
Could you link (or describe) a better explanation for why you believe that the Natural Abstraction Hypothesis (or a goal described in a way that is robust to ontological shifts; I consider both equivalent) is not a sound assumption? Because in such a case I believe we are mostly doomed. I don’t expect the ‘control problem’ to be solvable or consider that it makes sense for humanity to be able to have a leash on something superintelligent that can have a shift in its preferences.
Sure, I appreciate the open question!
That assumption is unsound with respect to what is sufficient for maintaining goal-directedness.
Any empirically-sound answer to the question of whether there is some way to describe a goal that is robust to ontological shifts (ie. define goals with respect to context-invariant perception of regular aspects of the environment, eg. somehow define diamonds by perception of tetrahedral carbon bonds) is still insufficient for solving the long-term safety of AGI.
This because what we are dealing with is machinery that continues to self-learn code from inputs, and continues to self-modify by replacing broken parts (perfect hardware copies are infeasible).
Which the machinery will need to do to be self-sufficient.
Ie. to adapt to the environment, to survive as an agent.
Natural abstractions are also leaky abstractions.
Meaning that even if AGI could internally define a goal robustly with respect to natural abstractions, AGI cannot conceptually contain within their modelling of natural abstractions all but a tiny portion of the (side-)effects propagating through the environment – as a result of all the interactions of the machinery’s functional components with connected physical surroundings.
Where such propagated effects will feed back into:
- changes in the virtualised code learned by the machinery based on sensor inputs.
- changes in the hardware configurations, at various levels of dependency, based on which continued to exist and replicate.
We need to define the problem comprehensively enough.
The scope of application of “Is there a way to define a goal in a way that is robust to ontological shifts” is not sufficient to address the overarching question “Can AGI be controlled to stay safe?”.
To state the problem comprehensively enough, you need include the global feedback dynamics that would necessarily happen through any AGI (as ‘self-sufficient learning machinery’) over time.
~ ~ ~
Here is also a relevant passage from the link I shared above:
No, the way I used the term was to point to robust abstractions to ontological concepts. Here’s an example: Say 1+1=A. A here obviously means 2 in our language, but it doesn’t change what A represents, ontologically. If A+1=4, then you have broken math, and that results in you being less capable in your reasoning and being “dutch booked”. Your world model is then incorrect, and it is very unlikely that any ontological shift will result in such a break in world model capabilities.
Math is a robust abstraction. “Natural abstractions”, as I use the term, points to abstractions for objects in the real world that share the same level of robustness to ontological shifts, such that as an AI gets better and better at modelling the world, its ontology tends more towards representing the objects in question with these abstractions.
That seems like a claim about the capabilities of arbitrarily powerful AI systems, one that relies on chaos theory or complex systems theory. I share your sentiment but doubt that things such as successor AI alignment will be difficult for ASIs.
Thanks for the clear elaboration.
I agree that natural abstractions would tend to get selected for in the agents that continue to exist and gain/uphold power to make changes in the world. Including because of Dutch-booking of incoherent preferences, because of instrumental convergence, and because relatively poorly functioning agents get selected out of the population.
However, those natural abstractions are still leaky in a sense similar to how platonic concepts are leaky abstractions. The natural abstraction of a circle does not map precisely to the actual physical shape of eg. a wheel identified to exist in the outside world.
In this sense, whatever natural abstractions AGI would use that allow the learning machinery to compress observations of actual physical instantiations of matter or energetic interactions in their modelling of the outside world, those natural abstractions would still fail to capture all the long-term-relevant features in the outside world.
This point I’m sure is obvious to you. But it bears repeating.
Yes, or more specifically: about fundamental limits of any AI system to control how its (side)-effects propagate and feed back over time.
Pretty much. Where “complex” refers to both internal algorithmic complexity (NP-computation branches, etc) and physical functional complexity (distributed non-linear amplifying feedback, etc).
This is not an argument. Given that people here are assessing what to do about x-risks, they should not rely on you stating your “doubt that...alignment will be difficult”.
I doubt that you thought this through comprehensively enough, and that your reasoning addresses the fundamental limits to controllability I summarised in this post.
The burden of proof is on you to comprehensively clarify your reasoning, given that you are in effect claiming that extinction risks can be engineered away.
You’d need to clarify specifically why functional components iteratively learned/assembled within AGI could have long-term predictable effects in physical interactions with shifting connected surroundings of a more physically complex outside world.
I don’t mind whether that’s framed as “AGI redesigns a successor version of their physically instantiated components” or “AGI keeps persisting in some modified form”.
Pretty sure that the problem of ensuring successor AIs are aligned to their predecessors is one that can be delegated to a capable and aligned AI. Asking for “perfect hardware copies” misses the point, in my opinion: it seems like you want me to accept that just because there isn’t a 100% chance of AI-to-AI successor alignment, humanity must attempt to retain continued control over the AI. Humanity is already less capable than the predecessor AI in my model, so trying to retain control would reliably lead to worse outcomes.
What is your reasoning?
I stated it in the comment you replied to:
Actually, that is switching to reasoning about something else.
Reasoning that the alternative (humans interacting with each other) would lead to reliably worse outcomes is not the same as reasoning about why AGI stay aligned in its effects on the world to stay safe to humans.
And with that switch, you are not addressing Nate Soares’ point that “capabilities generalize better than alignment”.
Nate Soares’ point did not depend on complex systems dynamics causing tiny miscalibrations to blow up into massive issues. The entire point of that essay is to show how ontological shifts are a major problem for alignment robustness.
I expect that AIs will be good enough at epistemology to do competent error correction and the problems you seem overly focused on are irrelevant.
Do you believe that all attempts at alignment are flawed and that we should stop building powerful ASIs entirely? I can’t quite get what your belief is.
Thanks, reading the post again, I do see quite a lot of emphasis on ontological shifts:
”Then, the system takes that sharp left turn, and, predictably, the capabilities quickly improve outside of its training distribution, while the alignment falls apart.”
How do you know that the degree of error correction possible will be sufficient to have any sound and valid guarantee of long-term AI safety?
Again, people really cannot rely on your personal expectation when it comes to machinery that could lead to the deaths of everyone
.
I’m looking for specific, well-thought-through arguments.
Yes, that is the conclusion based on me probing my mentor’s argumentation for 1.5 years, and concluding that the empirical premises are sound and the reasoning logically consistent.
I think the distinction you are trying to make is roughly that between ‘implicit/aligned control’ and ‘delegated control’ as terms used in this paper: https://dl.acm.org/doi/pdf/10.1145/3603371
Both still require control feedback processes built into the AGI system/infrastructure.