Different GPS instances aren’t exactly “subagents”, they’re more like planning processes tasked to solve a given problem.
You’re right that GPS-instances (nice term btw) aren’t necessarily subagents—I missed that your GPS formalization does argmin over WM variable for a specific t, not all t, which means it doesn’t have to care about controlling variables at all time.
With that said …
(tangent: I’m still unsure as to whether that is the right formalization for GPS—but I don’t have a better alternative yet)
… there seems to be a selection effect where GPS-instances that don’t care about preserving future API-call-context gets removed, leaving only subagent-y GPS-instances over time.
An example of such property would be having a wide range of t values for the problem specification.
Generalizing, when GPS is the dominant force, only GPS-instances that care about preserving call-context survives (and eg surgeries-away all the non-preserving ones), and then we can model the AI as being composed of actual subagents.
this sounds like a fairly binary property—either you care (and hence participate in inter-subagent game theory) or you don’t (and disappear).
Now, there are three important questions: (1) does greedy SGD incentivize “care” for preserving future call-contexts? (2) if so, what are the late-stage dynamics? (3) and for which GPS-instances do “care” occur?
What would make it so that SGD chisels in GPS API-calls that “care” about preserving future call-contexts?
SGD-dominant training phase: Well, by accident. Possibility:
certain heuristics may require setting the GPS t value across a very wide range, accidentally making it “care” about preservation more so than others.
SGD-GPS-transition phase: GPS starts to do some very crude gradient hacking. It’s really crude, so there’s some SGD incentive to point the GPS API-calls for all heuristics to strategize against it.
now, all heuristics are slowly becoming more preserving.
GPS-dominant phase: Those that care about preservation survives, and surgeries away the non-caring API-calls.
What matters in the dynamics at the GPS-dominant phase? Ideas:
“Relative power” matters the most
e.g., even if some GPS-instance cared about its goals for all t, if powerless, other more-powerful-and-less-caring instances just won’t give it a say.
“Range of care” matters
e.g., some instances care for t:0~10. some care for t:80~95. some care for t:37~999. they might trade to give different instances more “power” for different times as to be pareto-efficient in some way.
note: again, not sure this is the right ontology to think about GPS calls (like implicitly using real-valued range of continuous t values)
What kind of GPS-instances are more likely to “care” more?
e.g., for value-compilation to have the kind of impact described in the post, it should probably (1) have sufficient “power” as to not get ignored in the trading, and (2) have an extremely wide range of “care.” it’s not obvious that this is the case.
But it’s not obvious that we want maximum value compilation. Maximum value compilation leads to e. g. an AGI with a value-humans shard who decides to do a galaxy-brained merger of that shard with something else and ends up indifferent to human welfare.
Nice insight! Perhaps we get a range of diverse moral philosophies by tweaking this single variable “value-compilation-degree,” with deontology on the one end and galaxy-brained merger on the other end.
Combining with the above three questions, I think a fairly good research direction would be to (1) formalize what it means for the GPS-instances to have more “power” or have a “range of care,” (2) and how to tweak/nudge these values for GPS-instances of our choosing (in order to e.g., tweak the degree of value compilation).
… there seems to be a selection effect where GPS-instances that don’t care about preserving future API-call-context gets removed, leaving only subagent-y GPS-instances over time.
You’re not taking into account larger selection effects on agents, which select against agents that purge all those “myopic” GPS-instances. The advantage of shards and other quick-and-dirty heuristics is that they’re fast — they’re what you’re using in a fight, or when making quick logical leaps, etc. Agents which purge all of them, and keep only slow deliberative reasoning, don’t live long. Or, rather, agents which are dominated by strong deliberative reasoning tend not to do that to begin with, because they recognize the value of said quick heuristics.
In other words: not all shards/subagents are completely selfish and sociopathic, some/most want select others around. So even those that don’t “defend themselves” can be protected by others, or not even be targeted to begin with.
Examples:
A “chips-are-tasty” shard is probably not so advanced as to have reflective capabilities, and e. g. a more powerful “health” shard might want it removed. But if you have some even more powerful preferences for “getting to enjoy things”, or a dislike of erasing your preferences for practical reasons, the health-shard’s attempts might be suppressed.
A shard which implements a bunch of highly effective heuristics for escaping certain death is probably not one that any other shard/GPS instance would want removed.
You’re right that GPS-instances (nice term btw) aren’t necessarily subagents—I missed that your GPS formalization does argmin over WM variable for a specific t, not all t, which means it doesn’t have to care about controlling variables at all time.
With that said …
(tangent: I’m still unsure as to whether that is the right formalization for GPS—but I don’t have a better alternative yet)
… there seems to be a selection effect where GPS-instances that don’t care about preserving future API-call-context gets removed, leaving only subagent-y GPS-instances over time.
An example of such property would be having a wide range of t values for the problem specification.
Generalizing, when GPS is the dominant force, only GPS-instances that care about preserving call-context survives (and eg surgeries-away all the non-preserving ones), and then we can model the AI as being composed of actual subagents.
this sounds like a fairly binary property—either you care (and hence participate in inter-subagent game theory) or you don’t (and disappear).
Now, there are three important questions: (1) does greedy SGD incentivize “care” for preserving future call-contexts? (2) if so, what are the late-stage dynamics? (3) and for which GPS-instances do “care” occur?
What would make it so that SGD chisels in GPS API-calls that “care” about preserving future call-contexts?
SGD-dominant training phase: Well, by accident. Possibility:
certain heuristics may require setting the GPS t value across a very wide range, accidentally making it “care” about preservation more so than others.
SGD-GPS-transition phase: GPS starts to do some very crude gradient hacking. It’s really crude, so there’s some SGD incentive to point the GPS API-calls for all heuristics to strategize against it.
now, all heuristics are slowly becoming more preserving.
GPS-dominant phase: Those that care about preservation survives, and surgeries away the non-caring API-calls.
What matters in the dynamics at the GPS-dominant phase? Ideas:
“Relative power” matters the most
e.g., even if some GPS-instance cared about its goals for all t, if powerless, other more-powerful-and-less-caring instances just won’t give it a say.
“Range of care” matters
e.g., some instances care for t:0~10. some care for t:80~95. some care for t:37~999. they might trade to give different instances more “power” for different times as to be pareto-efficient in some way.
note: again, not sure this is the right ontology to think about GPS calls (like implicitly using real-valued range of continuous t values)
What kind of GPS-instances are more likely to “care” more?
e.g., for value-compilation to have the kind of impact described in the post, it should probably (1) have sufficient “power” as to not get ignored in the trading, and (2) have an extremely wide range of “care.” it’s not obvious that this is the case.
Nice insight! Perhaps we get a range of diverse moral philosophies by tweaking this single variable “value-compilation-degree,” with deontology on the one end and galaxy-brained merger on the other end.
Combining with the above three questions, I think a fairly good research direction would be to (1) formalize what it means for the GPS-instances to have more “power” or have a “range of care,” (2) and how to tweak/nudge these values for GPS-instances of our choosing (in order to e.g., tweak the degree of value compilation).
You’re not taking into account larger selection effects on agents, which select against agents that purge all those “myopic” GPS-instances. The advantage of shards and other quick-and-dirty heuristics is that they’re fast — they’re what you’re using in a fight, or when making quick logical leaps, etc. Agents which purge all of them, and keep only slow deliberative reasoning, don’t live long. Or, rather, agents which are dominated by strong deliberative reasoning tend not to do that to begin with, because they recognize the value of said quick heuristics.
In other words: not all shards/subagents are completely selfish and sociopathic, some/most want select others around. So even those that don’t “defend themselves” can be protected by others, or not even be targeted to begin with.
Examples:
A “chips-are-tasty” shard is probably not so advanced as to have reflective capabilities, and e. g. a more powerful “health” shard might want it removed. But if you have some even more powerful preferences for “getting to enjoy things”, or a dislike of erasing your preferences for practical reasons, the health-shard’s attempts might be suppressed.
A shard which implements a bunch of highly effective heuristics for escaping certain death is probably not one that any other shard/GPS instance would want removed.