Okay, more questions incoming: “Why would GPS be okay with value-compilation, when its expected outcome is to not satisfy in-distribution context behaviors through big-brain moves?”
If I understood correctly (can be skipped; not relevant to my argument, which starts after the bullet points):
Early in training, GPS is part of the heuristic/shard implementation (accessed via API calls)
Middle in training, there is some SGD-incentive towards pointing GPS in the direction of “reverse-engineering heuristic/shard firing patterns, representing them (Gs) as WM variables, and pursuing them directly”
Later in training, there is some SGD-incentive towards pointing GPS to do “value-compilation” by merging/representing more abstract i.e. “compiled” versions of the previous goal pointers and directly pursuing them.
In other words, there are some GPS API calls that say “do value compilation over GPS targets”, and some (previously formed) GPS API calls that say “pursue this reverse-engineered-heuristic-goal”
Very late (approx SGD influence ~ 0 due to starvation + hacking, where we can abstractize SGD as being extensions of GPS), everything is locked-in
Tangent: we can effectively model the AI as subagents, with each subagent is a GPS API-call. It wouldn’t be exactly right to call them “shards,” and it wouldn’t be right to call them “GPS” either (because GPS is presumably a single module; maybe not?). A new slick terminology might be needed … I’ll just call them “Subagents” for now.
To rephrase my question, why would the other (more early-formed) GPS API calls be okay with the API calls for value-compilation?
As you mentioned in a different comment thread, there is no reason for the GPS to obey in-distribution behavior (inner misalignment). So, from the perspective of a GPS that’s API-called with pursuing a reverse-engineered-heuristic-goal, it would think:
“hm, I notice that in the WM & past firing patterns, several goal-variables are being merged/abstractized and end up not obeying the expected in-distribution context behaviors, presumably due to some different GPS-API-call. Well, I don’t want that happening to my values—I need to somehow gradient hack in order to prevent that from happening!”
I think this depends on the optimization “power” distribution between different GPS API-calls (tangent: how is it possible for them to have different power when the GPS, presumably, is a modular thing and the only difference is in how they’re called? Whatever). Only if the API call for value compilation overwhelms the incentive against value compilation from the rest of the API calls (which all of them have an incentive for doing so, and would probably collude) then can value compilation actually proceed—which seems pretty unlikely.
Given this analysis, it seems like the default behavior is for the GPS API-calls to gradient hack away whatever other API-calls that would predictably result in in-distribution behaviors not getting preserved (e.g., value-compilation).
Alternate phrasing: The “Subagents” (check earlier bullet point for terminology) will have an incentive themselves to solve inner misalignment. Therefore, “Subagents” like “Value-compilation-GPS-API-call” that may do dangerous “big-brain” moves are naturally the enemy of everyone else, and will be gradient-hacked away from existence.
Is there any particular reason to believe the GPS API-calls-for-value-compilation would be so strongly chiseled in by the SGD (when SGD still has influence) as to overwhelm all the other API-calls (when SGD stops mattering)?
For reference, I think you’ve formed a pretty accurate model of my model.
Given this analysis, it seems like the default behavior is for the GPS API-calls to gradient hack away whatever other API-calls that would predictably result in in-distribution behaviors not getting preserved (e.g., value-compilation).
Yup. But this requires these GPS instances to be advanced enough to do gradient-hacking, and indeed be concerned with preventing their current values from being updated away. Two reasons not to expect that:
Different GPS instances aren’t exactly “subagents”, they’re more like planning processes tasked to solve a given problem.
Consider an impulse to escape from certain death. It’s an “instinctive” GPS instance; the GPS has been prompted with “escape certain death”, and that prompt is not downstream of abstract moral philosophy. It’s an instinctive reaction.
But this GPS instance doesn’t care about preventing future GPS instances from updating away the conditions that caused it to be initiated (i. e., the agent’s tendency to escape certain death). It’s just tasked with deriving a plan for the current situation.
It wouldn’t bother looking up what abstract-moral-philosophy is plotting and maybe try to counter-plot. It wouldn’t care about that.
(And even if it does care, it’d be far from its main priority, so it wouldn’t do effective counter-plotting.)
The hard-coded pointer to value compilation might plainly be chiseled-in before the agent is advanced enough to do gradient-hacking. In that case, even if a given GPS instance would care to plot against abstract values, it just wouldn’t know how (or know that it needs to).
That said, you’re absolutely right that it does happen in real-life agents. Some humans are suspicious of abstract arguments for the greater good and refuse to e. g. go from deontologists to utilitarians. The strength of the drive for value compilation relative to shards’ strength is varying, and depending on it, the process of value compilation may be frozen at some arbitrary point.
It partly falls under the meta-cognition section. But in even more extreme cases, a person may simply refuse to engage in value-compilation at all, express a preference to not be coherent.
… Which is an interesting point, actually. We want there to be some value compilation, or agents just wouldn’t be able to generalize OOD at all. But it’s not obvious that we want maximum value compilation. Maximum value compilation leads to e. g. an AGI with a value-humans shard who decides to do a galaxy-brained merger of that shard with something else and ends up indifferent to human welfare. But maybe we can replicate what human deontologists are doing, and alter the “power distribution” among the AGI’s processes such that value compilation freezes just before this point?
I may be overlooking some reason this wouldn’t work, but seems promising at first glance.
Different GPS instances aren’t exactly “subagents”, they’re more like planning processes tasked to solve a given problem.
You’re right that GPS-instances (nice term btw) aren’t necessarily subagents—I missed that your GPS formalization does argmin over WM variable for a specific t, not all t, which means it doesn’t have to care about controlling variables at all time.
With that said …
(tangent: I’m still unsure as to whether that is the right formalization for GPS—but I don’t have a better alternative yet)
… there seems to be a selection effect where GPS-instances that don’t care about preserving future API-call-context gets removed, leaving only subagent-y GPS-instances over time.
An example of such property would be having a wide range of t values for the problem specification.
Generalizing, when GPS is the dominant force, only GPS-instances that care about preserving call-context survives (and eg surgeries-away all the non-preserving ones), and then we can model the AI as being composed of actual subagents.
this sounds like a fairly binary property—either you care (and hence participate in inter-subagent game theory) or you don’t (and disappear).
Now, there are three important questions: (1) does greedy SGD incentivize “care” for preserving future call-contexts? (2) if so, what are the late-stage dynamics? (3) and for which GPS-instances do “care” occur?
What would make it so that SGD chisels in GPS API-calls that “care” about preserving future call-contexts?
SGD-dominant training phase: Well, by accident. Possibility:
certain heuristics may require setting the GPS t value across a very wide range, accidentally making it “care” about preservation more so than others.
SGD-GPS-transition phase: GPS starts to do some very crude gradient hacking. It’s really crude, so there’s some SGD incentive to point the GPS API-calls for all heuristics to strategize against it.
now, all heuristics are slowly becoming more preserving.
GPS-dominant phase: Those that care about preservation survives, and surgeries away the non-caring API-calls.
What matters in the dynamics at the GPS-dominant phase? Ideas:
“Relative power” matters the most
e.g., even if some GPS-instance cared about its goals for all t, if powerless, other more-powerful-and-less-caring instances just won’t give it a say.
“Range of care” matters
e.g., some instances care for t:0~10. some care for t:80~95. some care for t:37~999. they might trade to give different instances more “power” for different times as to be pareto-efficient in some way.
note: again, not sure this is the right ontology to think about GPS calls (like implicitly using real-valued range of continuous t values)
What kind of GPS-instances are more likely to “care” more?
e.g., for value-compilation to have the kind of impact described in the post, it should probably (1) have sufficient “power” as to not get ignored in the trading, and (2) have an extremely wide range of “care.” it’s not obvious that this is the case.
But it’s not obvious that we want maximum value compilation. Maximum value compilation leads to e. g. an AGI with a value-humans shard who decides to do a galaxy-brained merger of that shard with something else and ends up indifferent to human welfare.
Nice insight! Perhaps we get a range of diverse moral philosophies by tweaking this single variable “value-compilation-degree,” with deontology on the one end and galaxy-brained merger on the other end.
Combining with the above three questions, I think a fairly good research direction would be to (1) formalize what it means for the GPS-instances to have more “power” or have a “range of care,” (2) and how to tweak/nudge these values for GPS-instances of our choosing (in order to e.g., tweak the degree of value compilation).
… there seems to be a selection effect where GPS-instances that don’t care about preserving future API-call-context gets removed, leaving only subagent-y GPS-instances over time.
You’re not taking into account larger selection effects on agents, which select against agents that purge all those “myopic” GPS-instances. The advantage of shards and other quick-and-dirty heuristics is that they’re fast — they’re what you’re using in a fight, or when making quick logical leaps, etc. Agents which purge all of them, and keep only slow deliberative reasoning, don’t live long. Or, rather, agents which are dominated by strong deliberative reasoning tend not to do that to begin with, because they recognize the value of said quick heuristics.
In other words: not all shards/subagents are completely selfish and sociopathic, some/most want select others around. So even those that don’t “defend themselves” can be protected by others, or not even be targeted to begin with.
Examples:
A “chips-are-tasty” shard is probably not so advanced as to have reflective capabilities, and e. g. a more powerful “health” shard might want it removed. But if you have some even more powerful preferences for “getting to enjoy things”, or a dislike of erasing your preferences for practical reasons, the health-shard’s attempts might be suppressed.
A shard which implements a bunch of highly effective heuristics for escaping certain death is probably not one that any other shard/GPS instance would want removed.
Okay, more questions incoming: “Why would GPS be okay with value-compilation, when its expected outcome is to not satisfy in-distribution context behaviors through big-brain moves?”
If I understood correctly (can be skipped; not relevant to my argument, which starts after the bullet points):
Early in training, GPS is part of the heuristic/shard implementation (accessed via API calls)
Middle in training, there is some SGD-incentive towards pointing GPS in the direction of “reverse-engineering heuristic/shard firing patterns, representing them (Gs) as WM variables, and pursuing them directly”
Later in training, there is some SGD-incentive towards pointing GPS to do “value-compilation” by merging/representing more abstract i.e. “compiled” versions of the previous goal pointers and directly pursuing them.
In other words, there are some GPS API calls that say “do value compilation over GPS targets”, and some (previously formed) GPS API calls that say “pursue this reverse-engineered-heuristic-goal”
Very late (approx SGD influence ~ 0 due to starvation + hacking, where we can abstractize SGD as being extensions of GPS), everything is locked-in
Tangent: we can effectively model the AI as subagents, with each subagent is a GPS API-call. It wouldn’t be exactly right to call them “shards,” and it wouldn’t be right to call them “GPS” either (because GPS is presumably a single module; maybe not?). A new slick terminology might be needed … I’ll just call them “Subagents” for now.
To rephrase my question, why would the other (more early-formed) GPS API calls be okay with the API calls for value-compilation?
As you mentioned in a different comment thread, there is no reason for the GPS to obey in-distribution behavior (inner misalignment). So, from the perspective of a GPS that’s API-called with pursuing a reverse-engineered-heuristic-goal, it would think:
“hm, I notice that in the WM & past firing patterns, several goal-variables are being merged/abstractized and end up not obeying the expected in-distribution context behaviors, presumably due to some different GPS-API-call. Well, I don’t want that happening to my values—I need to somehow gradient hack in order to prevent that from happening!”
I think this depends on the optimization “power” distribution between different GPS API-calls (tangent: how is it possible for them to have different power when the GPS, presumably, is a modular thing and the only difference is in how they’re called? Whatever). Only if the API call for value compilation overwhelms the incentive against value compilation from the rest of the API calls (which all of them have an incentive for doing so, and would probably collude) then can value compilation actually proceed—which seems pretty unlikely.
Given this analysis, it seems like the default behavior is for the GPS API-calls to gradient hack away whatever other API-calls that would predictably result in in-distribution behaviors not getting preserved (e.g., value-compilation).
Alternate phrasing: The “Subagents” (check earlier bullet point for terminology) will have an incentive themselves to solve inner misalignment. Therefore, “Subagents” like “Value-compilation-GPS-API-call” that may do dangerous “big-brain” moves are naturally the enemy of everyone else, and will be gradient-hacked away from existence.
Is there any particular reason to believe the GPS API-calls-for-value-compilation would be so strongly chiseled in by the SGD (when SGD still has influence) as to overwhelm all the other API-calls (when SGD stops mattering)?
For reference, I think you’ve formed a pretty accurate model of my model.
Yup. But this requires these GPS instances to be advanced enough to do gradient-hacking, and indeed be concerned with preventing their current values from being updated away. Two reasons not to expect that:
Different GPS instances aren’t exactly “subagents”, they’re more like planning processes tasked to solve a given problem.
Consider an impulse to escape from certain death. It’s an “instinctive” GPS instance; the GPS has been prompted with “escape certain death”, and that prompt is not downstream of abstract moral philosophy. It’s an instinctive reaction.
But this GPS instance doesn’t care about preventing future GPS instances from updating away the conditions that caused it to be initiated (i. e., the agent’s tendency to escape certain death). It’s just tasked with deriving a plan for the current situation.
It wouldn’t bother looking up what abstract-moral-philosophy is plotting and maybe try to counter-plot. It wouldn’t care about that.
(And even if it does care, it’d be far from its main priority, so it wouldn’t do effective counter-plotting.)
The hard-coded pointer to value compilation might plainly be chiseled-in before the agent is advanced enough to do gradient-hacking. In that case, even if a given GPS instance would care to plot against abstract values, it just wouldn’t know how (or know that it needs to).
That said, you’re absolutely right that it does happen in real-life agents. Some humans are suspicious of abstract arguments for the greater good and refuse to e. g. go from deontologists to utilitarians. The strength of the drive for value compilation relative to shards’ strength is varying, and depending on it, the process of value compilation may be frozen at some arbitrary point.
It partly falls under the meta-cognition section. But in even more extreme cases, a person may simply refuse to engage in value-compilation at all, express a preference to not be coherent.
… Which is an interesting point, actually. We want there to be some value compilation, or agents just wouldn’t be able to generalize OOD at all. But it’s not obvious that we want maximum value compilation. Maximum value compilation leads to e. g. an AGI with a value-humans shard who decides to do a galaxy-brained merger of that shard with something else and ends up indifferent to human welfare. But maybe we can replicate what human deontologists are doing, and alter the “power distribution” among the AGI’s processes such that value compilation freezes just before this point?
I may be overlooking some reason this wouldn’t work, but seems promising at first glance.
You’re right that GPS-instances (nice term btw) aren’t necessarily subagents—I missed that your GPS formalization does argmin over WM variable for a specific t, not all t, which means it doesn’t have to care about controlling variables at all time.
With that said …
(tangent: I’m still unsure as to whether that is the right formalization for GPS—but I don’t have a better alternative yet)
… there seems to be a selection effect where GPS-instances that don’t care about preserving future API-call-context gets removed, leaving only subagent-y GPS-instances over time.
An example of such property would be having a wide range of t values for the problem specification.
Generalizing, when GPS is the dominant force, only GPS-instances that care about preserving call-context survives (and eg surgeries-away all the non-preserving ones), and then we can model the AI as being composed of actual subagents.
this sounds like a fairly binary property—either you care (and hence participate in inter-subagent game theory) or you don’t (and disappear).
Now, there are three important questions: (1) does greedy SGD incentivize “care” for preserving future call-contexts? (2) if so, what are the late-stage dynamics? (3) and for which GPS-instances do “care” occur?
What would make it so that SGD chisels in GPS API-calls that “care” about preserving future call-contexts?
SGD-dominant training phase: Well, by accident. Possibility:
certain heuristics may require setting the GPS t value across a very wide range, accidentally making it “care” about preservation more so than others.
SGD-GPS-transition phase: GPS starts to do some very crude gradient hacking. It’s really crude, so there’s some SGD incentive to point the GPS API-calls for all heuristics to strategize against it.
now, all heuristics are slowly becoming more preserving.
GPS-dominant phase: Those that care about preservation survives, and surgeries away the non-caring API-calls.
What matters in the dynamics at the GPS-dominant phase? Ideas:
“Relative power” matters the most
e.g., even if some GPS-instance cared about its goals for all t, if powerless, other more-powerful-and-less-caring instances just won’t give it a say.
“Range of care” matters
e.g., some instances care for t:0~10. some care for t:80~95. some care for t:37~999. they might trade to give different instances more “power” for different times as to be pareto-efficient in some way.
note: again, not sure this is the right ontology to think about GPS calls (like implicitly using real-valued range of continuous t values)
What kind of GPS-instances are more likely to “care” more?
e.g., for value-compilation to have the kind of impact described in the post, it should probably (1) have sufficient “power” as to not get ignored in the trading, and (2) have an extremely wide range of “care.” it’s not obvious that this is the case.
Nice insight! Perhaps we get a range of diverse moral philosophies by tweaking this single variable “value-compilation-degree,” with deontology on the one end and galaxy-brained merger on the other end.
Combining with the above three questions, I think a fairly good research direction would be to (1) formalize what it means for the GPS-instances to have more “power” or have a “range of care,” (2) and how to tweak/nudge these values for GPS-instances of our choosing (in order to e.g., tweak the degree of value compilation).
You’re not taking into account larger selection effects on agents, which select against agents that purge all those “myopic” GPS-instances. The advantage of shards and other quick-and-dirty heuristics is that they’re fast — they’re what you’re using in a fight, or when making quick logical leaps, etc. Agents which purge all of them, and keep only slow deliberative reasoning, don’t live long. Or, rather, agents which are dominated by strong deliberative reasoning tend not to do that to begin with, because they recognize the value of said quick heuristics.
In other words: not all shards/subagents are completely selfish and sociopathic, some/most want select others around. So even those that don’t “defend themselves” can be protected by others, or not even be targeted to begin with.
Examples:
A “chips-are-tasty” shard is probably not so advanced as to have reflective capabilities, and e. g. a more powerful “health” shard might want it removed. But if you have some even more powerful preferences for “getting to enjoy things”, or a dislike of erasing your preferences for practical reasons, the health-shard’s attempts might be suppressed.
A shard which implements a bunch of highly effective heuristics for escaping certain death is probably not one that any other shard/GPS instance would want removed.