Feel free to still clarify the other reasons why the changes in learning would be stable in preserving “good properties”. Then I will take that starting point to try explain why the mutually reinforcing dynamics of instrumental convergence and substrate-needs convergence override that stability.
Fundamentally though, we’ll still be discussing the application limits of error correction methods.
Three ways to explain why:
AnyworkableAI-alignment method involves receiving input signals, comparing input signals against internal references, and outputting corrective signals to maintain alignment of outside states against those references (ie. error correction).
Any workable AI-alignment method involves a control feedback loop – of detecting the actual (or simulating the potential) effects internally and then correcting actual (or preventing the potential) effects externally (ie. error correction).
Eg. mechanistic interpretability is essentially about “detecting the actual (or simulating the potential) effects internally” of AI.
The only way to actually (slightly) counteract AGI convergence on causing “instrumental” and “needed” effects within a more complex environment is to simulate/detect and then prevent/correct those environmental effects (ie. error correction).
~ ~ ~ Which brings us back to why error correction methods, of any kind and in any combination, cannot ensure long-term AGI Safety.
I reread your original post and Christiano’s comment to understand your reasoning better and see how I could limits of applicability of error correction methods.
I also messaged Forrest (the polymath) to ask for his input.
The messages were of a high enough quality that I won’t bother rewriting the text. Let me copy-paste the raw exchange below (with few spelling edits).
Remmelt 15:38 Remmelt: “As another example [of unsound monolithic reasoning], your idea of Von Neuman Probes with error correcting codes, referred to by Christiano here (https://www.lesswrong.com/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic?commentId=Jaf9b9YAARYdrK3jp), cannot soundly work for AGI code (as self-learning new code for processing inputs into outputs, and as introducing errors through interactions with the environment that cannot be detected and corrected). This is overdetermined. An ex-Pentagon engineer has spelled out the reasons to me. See a one-page summary by me here.”
Carl Shulman: ”This is overstating what role error-correcting codes play in that argument. They mean the same programs can be available and evaluate things for eons (and can evaluate later changes with various degrees of learning themselves), but don’t cover all changes that could derive from learning (although there are other reasons why those could be stable in preserving good or terrible properties).”
Remmelt 15:40 Excerpting from the comment by Christiano I link to above: ”The production-web has no interest in ensuring that its members value production above other ends, only in ensuring that they produce (which today happens for instrumental reasons). If consequentialists within the system intrinsically value production it’s either because of single-single alignment failures (i.e. someone who valued production instrumentally delegated to a system that values it intrinsically) or because of new distributed consequentialism distinct from either the production web itself or any of the actors in it, but you don’t describe what those distributed consequentialists are like or how they come about.
And more pragmatically, such competition most obviously causes harm either via a space race and insecure property rights, or war between blocs with higher and lower savings rates (some of them too low to support human life, which even if you don’t buy Carl’s argument is really still quite low, conferring a tiny advantage). If those are the chief mechanisms then it seems important to think/talk about the kinds of agreements and treaties that humans (or aligned machines acting on their behalf!) would be trying to arrange in order to avoid those wars.”
”But the program of an AI, large stores of astronomical observations for navigation, and vast stores of technological information would take up an enormous amount of memory and storage space, perhaps many exabytes or more. Given this large body of information, adding additional directives to ensure that the probes eventually turn to producing welfare need only increase storage needs by a very small proportion, e.g. by 1 in 1 billion. Directives could directly specify the criteria to be eventually optimized, or could simply require compliance with further orders traveling behind the frontier of colonization. ... Mutation is easier to resist for computers than animals Biological life on Earth has evolved through mutation, and the reproductive process introduces significant errors in each generation. However, digital information storage allows for the comparison of redundant copies and the use of error-correcting codes, making substantive mutation many orders of magnitude less likely than in Earthly life.”
Remmelt 15:45
Returning to the new comment by Carl Schulman: ”This is overstating what role error-correcting codes play in that argument. They mean the same programs can be available and evaluate things for eons (and can evaluate later changes with various degrees of learning themselves)”
Remmelt 15:46
Thinking about this overnight, I think Carl’s stated reasoning is still unsound for multiple reasons:
1. Contradiction between a Von Neummann Probe being adaptable enough (ie. learning new ways of processing inputs into outputs) to travel across space and seed new civilisations, yet have error correcting code that allows comparison of new code with original redundant copies. Not going to work, for reasons Forrest amply explained and I tried to summarise here: https://docs.google.com/document/d/1-AAhqvgFNx_MlLkcSgw-chvmFoC4EZ4LmTl1IWcsqEA/edit
2. Confuses complicated pre-loaded technological knowledge/systems with complex adaptive systems. The fact that they are saying that adding in directives would only increase storage by 1 part in 1 billion parts is a giveaway, I think.
Remmelt 15:55
3. Inverse take on 1. Algorithms which can flexibly ‘mutate’ and branch out into different versions become better at using resources and multiplying than more rigid or robustly functional designs. This makes Carl Schulman’s case for launching out self-replicating space probes with code error-checking/minimisation routines seem a lot more dicey. If a defecting group launches even one alternate design with a flexible code-mutating ability that confers an advantage that can’t easily be copied by the error-minimising designs without compromising on their ability to act on the directives humans originally coded in to ‘directly specify the criteria to be eventually optimized’ – well, then you might end up instead with swarms of spaceprobes that eat up the galaxy indiscriminately, including any remaining carbon-based lifeforms on planet Earth.
Underlying premise: even if humans construct a long-term aligned AI design – where humans can formally prove a model to causally constrain any possible process of agency emerging from and expanding across each of the physical parts in which this model infers its computational process to be embedded to stay within all fundamental bounds necessary for maintaining alignment with the values that humans broadly share in common – then in practice that design is ‘one step away’ from getting mutated into misalignment by a faction of humans who seek a capability advantage (does it give one though?) for manifesting their more granular personal values.
Remmelt 15:57
@Forrest anything to add to the above? How about I write that into a comment reply (which perhaps might make for a stand-alone post later?)
Forrest 16:25
The starting point is: That any procedure of AGI alignment will (Ie, cannot not) resemble some sort of error correction algorithm
Forrest 16:28
This point cannot be disagreed with and still have the notion of alignment be persistent. If they argue here, they will need to establish a conformable coherent counter example. Simple contradiction is not enough, as the claim basis shifts to them.
Forrest 16:30
Once hard line equivalence is firmly established, then the argument re limits of causation in all error correction processes apply.
Remmelt 16:44
Good one. So start with how any AGI-alignment method must resemble an error correction algorithm in order to correct out misaligned behaviour learned and expressed in interaction with the environment.
Forrest 16:48
In regards to the idea that some machine could “have a strategy of optimizing for near-maximum X until condition Z is achieved, and *then* implement directive Q” makes at least three assumptions that are highly unlikely.
1; there is no assurance that in the long duration of the optimization of X that there would be persistence of directive Q.
2; there is no assurance that the condition ‘Z’ is consistently detectable, or that the triggering condition is even probable to being implemented, even if detected (as per a violation of assumption 1).
3; the integration coupling of directive Q is mismatched to the sheer mass of the output of the optimized code—ie, it is unlikely that ‘Q’, no matter how formed, will actually result in the consistent behavior described. Ie, ‘Q’ is significantly underspecified, and if ‘Q’ was appropriately specified, it would not be “lightweight” with respect to the optimization process ‘X’, and would have been optimized away to ineffectiveness. Hence, even if Z condition trigger happens, Q does not happen.
Note insofar as the logic is directly generalized over the logical claim form of “adopt a strategy of near-maximum colonization speed until they reach the most distant accessible galaxies, and only then divert resources to producing welfare”, the proposal itself is not well specified. Ie, the assumption that ‘Q’ and detection of ‘Z’ can be “simple” is a conceptual category error.
Remmelt 17:02
I read through this, and think I understand your points.
Forrest 17:10
Yes. I am trying to think of a good metaphor. It is maybe a bit like some single ordinary person trying to command everyone to do something in some consistent way, when that _everyone_ is 100000 people all yelling at one another in some huge auditorium somewhere. The assumption that one person, even at maximum personal signal volume and intensity, could be heard over all similar other people (in this case, the other objectives of the Von Neumann probe), and command them to implement some new action pattern reliably consistent with ‘Q’, is simply deeply mismatched to the output reality of the optimization process ‘X’. The single directive ‘Q’ is simply not powerful enough to fully conditionalize all of ‘X’, even under triggering conditions ‘Z’.
Also, I notice that the assumptions that the error correction process could be applied linearly to a dynamic self recursive system at arbitrary levels of abstraction is simply another category error. Wrong tool for the wrong job. That is less important, though, than the issue with the Q directive arbitrarily efficient effectivity mismatch.
Returning on error correction point:
Feel free to still clarify the other reasons why the changes in learning would be stable in preserving “good properties”. Then I will take that starting point to try explain why the mutually reinforcing dynamics of instrumental convergence and substrate-needs convergence override that stability.
Fundamentally though, we’ll still be discussing the application limits of error correction methods.
Three ways to explain why:
Any workable AI-alignment method involves receiving input signals, comparing input signals against internal references, and outputting corrective signals to maintain alignment of outside states against those references (ie. error correction).
Any workable AI-alignment method involves a control feedback loop – of detecting the actual (or simulating the potential) effects internally and then correcting actual (or preventing the potential) effects externally (ie. error correction).
Eg. mechanistic interpretability is essentially about “detecting the actual (or simulating the potential) effects internally” of AI.
The only way to actually (slightly) counteract AGI convergence on causing “instrumental” and “needed” effects within a more complex environment is to simulate/detect and then prevent/correct those environmental effects (ie. error correction).
~ ~ ~
Which brings us back to why error correction methods, of any kind and in any combination, cannot ensure long-term AGI Safety.
I reread your original post and Christiano’s comment to understand your reasoning better and see how I could limits of applicability of error correction methods.
I also messaged Forrest (the polymath) to ask for his input.
The messages were of a high enough quality that I won’t bother rewriting the text. Let me copy-paste the raw exchange below (with few spelling edits).
Remmelt 15:37
@Forrest, would value your thoughts on the way Carl Schulman is thinking about error correcting code, perhaps to pass on on the LessWrong Forum:
(https://www.lesswrong.com/posts/uFNgRumrDTpBfQGrs/let-s-think-about-slowing-down-ai?commentId=bY87i5v5StH9FWdWy).
Remmelt 15:38
Remmelt:
“As another example [of unsound monolithic reasoning], your idea of Von Neuman Probes with error correcting codes, referred to by Christiano here (https://www.lesswrong.com/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic?commentId=Jaf9b9YAARYdrK3jp), cannot soundly work for AGI code (as self-learning new code for processing inputs into outputs, and as introducing errors through interactions with the environment that cannot be detected and corrected). This is overdetermined. An ex-Pentagon engineer has spelled out the reasons to me. See a one-page summary by me here.”
Carl Shulman:
”This is overstating what role error-correcting codes play in that argument. They mean the same programs can be available and evaluate things for eons (and can evaluate later changes with various degrees of learning themselves), but don’t cover all changes that could derive from learning (although there are other reasons why those could be stable in preserving good or terrible properties).”
Remmelt 15:40
Excerpting from the comment by Christiano I link to above:
”The production-web has no interest in ensuring that its members value production above other ends, only in ensuring that they produce (which today happens for instrumental reasons). If consequentialists within the system intrinsically value production it’s either because of single-single alignment failures (i.e. someone who valued production instrumentally delegated to a system that values it intrinsically) or because of new distributed consequentialism distinct from either the production web itself or any of the actors in it, but you don’t describe what those distributed consequentialists are like or how they come about.
You might say: investment has to converge to 100% since people with lower levels of investment get outcompeted. But this it seems like the actual efficiency loss required to preserve human values seems very small even over cosmological time (e.g. see Carl on exactly this question: http://reflectivedisequilibrium.blogspot.com/2012/09/spreading-happiness-to-stars-seems.html).
And more pragmatically, such competition most obviously causes harm either via a space race and insecure property rights, or war between blocs with higher and lower savings rates (some of them too low to support human life, which even if you don’t buy Carl’s argument is really still quite low, conferring a tiny advantage). If those are the chief mechanisms then it seems important to think/talk about the kinds of agreements and treaties that humans (or aligned machines acting on their behalf!) would be trying to arrange in order to avoid those wars.”
Remmelt 15:41
And Carl Schulman’s original post on long-term error-correcting Von Neumann Probes:
(http://reflectivedisequilibrium.blogspot.com/2012/09/spreading-happiness-to-stars-seems.html):
”But the program of an AI, large stores of astronomical observations for navigation, and vast stores of technological information would take up an enormous amount of memory and storage space, perhaps many exabytes or more. Given this large body of information, adding additional directives to ensure that the probes eventually turn to producing welfare need only increase storage needs by a very small proportion, e.g. by 1 in 1 billion. Directives could directly specify the criteria to be eventually optimized, or could simply require compliance with further orders traveling behind the frontier of colonization.
...
Mutation is easier to resist for computers than animals
Biological life on Earth has evolved through mutation, and the reproductive process introduces significant errors in each generation. However, digital information storage allows for the comparison of redundant copies and the use of error-correcting codes, making substantive mutation many orders of magnitude less likely than in Earthly life.”
Remmelt 15:45
Returning to the new comment by Carl Schulman:
”This is overstating what role error-correcting codes play in that argument. They mean the same programs can be available and evaluate things for eons (and can evaluate later changes with various degrees of learning themselves)”
Remmelt 15:46
Thinking about this overnight, I think Carl’s stated reasoning is still unsound for multiple reasons:
1. Contradiction between a Von Neummann Probe being adaptable enough (ie. learning new ways of processing inputs into outputs) to travel across space and seed new civilisations, yet have error correcting code that allows comparison of new code with original redundant copies. Not going to work, for reasons Forrest amply explained and I tried to summarise here: https://docs.google.com/document/d/1-AAhqvgFNx_MlLkcSgw-chvmFoC4EZ4LmTl1IWcsqEA/edit
Ooh, and in Forrest’s AGI Error Correction post: https://mflb.com/ai_alignment_1/agi_error_correction_psr.html#p1
Think I’ll share that.
Remmelt 15:54
2. Confuses complicated pre-loaded technological knowledge/systems with complex adaptive systems. The fact that they are saying that adding in directives would only increase storage by 1 part in 1 billion parts is a giveaway, I think.
Remmelt 15:55
3. Inverse take on 1.
Algorithms which can flexibly ‘mutate’ and branch out into different versions become better at using resources and multiplying than more rigid or robustly functional designs. This makes Carl Schulman’s case for launching out self-replicating space probes with code error-checking/minimisation routines seem a lot more dicey. If a defecting group launches even one alternate design with a flexible code-mutating ability that confers an advantage that can’t easily be copied by the error-minimising designs without compromising on their ability to act on the directives humans originally coded in to ‘directly specify the criteria to be eventually optimized’ – well, then you might end up instead with swarms of spaceprobes that eat up the galaxy indiscriminately, including any remaining carbon-based lifeforms on planet Earth.
Underlying premise: even if humans construct a long-term aligned AI design – where humans can formally prove a model to causally constrain any possible process of agency emerging from and expanding across each of the physical parts in which this model infers its computational process to be embedded to stay within all fundamental bounds necessary for maintaining alignment with the values that humans broadly share in common – then in practice that design is ‘one step away’ from getting mutated into misalignment by a faction of humans who seek a capability advantage (does it give one though?) for manifesting their more granular personal values.
Remmelt 15:57
@Forrest anything to add to the above? How about I write that into a comment reply (which perhaps might make for a stand-alone post later?)
Forrest 16:25
The starting point is: That any procedure of AGI alignment will
(Ie, cannot not) resemble some sort of error correction algorithm
Forrest 16:28
This point cannot be disagreed with and still have the notion of alignment be persistent. If they argue here, they will need to establish a conformable coherent counter example. Simple contradiction is not enough, as the claim basis shifts to them.
Forrest 16:30
Once hard line equivalence is firmly established, then the argument re limits of causation in all error correction processes apply.
Remmelt 16:44
Good one. So start with how any AGI-alignment method must resemble an error correction algorithm in order to correct out misaligned behaviour learned and expressed in interaction with the environment.
Forrest 16:48
In regards to the idea that some machine could “have a strategy of optimizing for near-maximum X until condition Z is achieved, and *then* implement directive Q” makes at least three assumptions that are highly unlikely.
1; there is no assurance that in the long duration of the optimization of X that there would be persistence of directive Q.
2; there is no assurance that the condition ‘Z’ is consistently detectable, or that the triggering condition is even probable to being implemented, even if detected (as per a violation of assumption 1).
3; the integration coupling of directive Q is mismatched to the sheer mass of the output of the optimized code—ie, it is unlikely that ‘Q’, no matter how formed, will actually result in the consistent behavior described. Ie, ‘Q’ is significantly underspecified, and if ‘Q’ was appropriately specified, it would not be “lightweight” with respect to the optimization process ‘X’, and would have been optimized away to ineffectiveness. Hence, even if Z condition trigger happens, Q does not happen.
Note insofar as the logic is directly generalized over the logical claim form of “adopt a strategy of near-maximum colonization speed until they reach the most distant accessible galaxies, and only then divert resources to producing welfare”, the proposal itself is not well specified. Ie, the assumption that ‘Q’ and detection of ‘Z’ can be “simple” is a conceptual category error.
Remmelt 17:02
I read through this, and think I understand your points.
Forrest 17:10
Yes. I am trying to think of a good metaphor. It is maybe a bit like some single ordinary person trying to command everyone to do something in some consistent way, when that _everyone_ is 100000 people all yelling at one another in some huge auditorium somewhere. The assumption that one person, even at maximum personal signal volume and intensity, could be heard over all similar other people (in this case, the other objectives of the Von Neumann probe), and command them to implement some new action pattern reliably consistent with ‘Q’, is simply deeply mismatched to the output reality of the optimization process ‘X’. The single directive ‘Q’ is simply not powerful enough to fully conditionalize all of ‘X’, even under triggering conditions ‘Z’.
Also, I notice that the assumptions that the error correction process could be applied linearly to a dynamic self recursive system at arbitrary levels of abstraction is simply another category error. Wrong tool for the wrong job. That is less important, though, than the issue with the Q directive arbitrarily efficient effectivity mismatch.
Forrest 17:37
Also, I added the following document to assist in some of what you are trying to do above: https://mflb.com/ai_alignment_1/tech_align_error_correct_fail_psr.html#p1
This echos something I think I sent previously, but I could not find it in another doc, so I added it.