Any goal specification implies a cluster of target states that in turn can be taken as a measure. Given that, it seems that any goal specification short of a complete extrapolation of the goal-setter’s volition is subject to Goodhart’s Law. If so, we should expect everything short of direct action by the goal-setter to be goodharted all the time.
From this perspective, it seems like a miracle that any large-scale action works at all. How do we do it?
My guess: goal specification is used to disambiguate communication rather than for optimization. The agents have some relatively fixed modes of “what they do” and the principals will is “only” used to pick one of those. If the peak of the principals will falls “inbetween” options, so sad, it will be rounded off to the nearest “supported” option. With this agents make a sensible action even if used by multiple principles which would like to “customise” its behaviour in weird ways, each off course in a different manner. Clunky but uncorrupted. You don’t have to watch out what you wish for you don’t get what you wish for.
Hmm. In order to avoid goodharting, the composite agent should be structured such that its actions emerge coherently from the agencies of the subagents. Any top-down framework I can think of is a no-go and looking at how the subagents get their own agencies hints at infinite regress. My brain hurts (in a good way).
What specification the principal gives to the agent does not need to be the will or what ends up happening with the composite agent. In order to get goodharting there needs to be optimization and while goal specifications can be used for optimizations, specification involvement doesn’t necceciate the optimization happens. For example if the agent tried to be maximally harmful to the principal then the principal could “reverse psychology” goal specification which is not a representation the principal would use for themselfs but gets the desired result when used in that assignment.
The tricky thing then is that thing could be so that the pricniple hands in an assigment that triggers action that is wastly different from what the assignment describes but the end action ends up being beneficial to the composite system. That is principal goes “go buy me the most expensive hammer”, agent “thinks well I could get a million dollar hammer but I am just going to get a thousand dollar hammer instead of a ten dollar hammer”. The agent is influenced, different principles could result in the agent going for the ten dollar hammer, but assignment is significantly “unfullfilled”. A prompt that would naively read to point to the million dollar hammer, might almost never result in the million dollar hammer. So even if the shoplist writer “overdemands” and the shopvisiter “underdelivers” the overall system can end up making resonable actions (which no part of the components specified).
Big update toward the principal serving a coordinating function only (alwayshasbeen.png).
Subagents will unavoidably operate under their own agency; any design where their agenda is fully set from above would goodhart by definition. The only scenario where there’s non-goodhart coherence seems to be where there’s some sort of alignment between the principal’s agenda and the agency of the subagents.
ETA: The subagent receives the edict of the principal and fills in the details using its own agency. The resulting actions make sense to the extent the subagent has (and uses) “common sense”.
Appreciate the suggestion. I’m noting some internal resistance partly due to scope creep. If this is a valid generalization of Goodhart, then it a) starts to suspiciosly resemble the alignment problem and b) suggests that there might be something to glean about how we succeed or fail at dodging Goodhart as a society.
Cool. (FWIW, IMO questions can be short short and simple. Also, yeah I think it’s related to alignment; a sort of rephrasing would be “there’s already alignment between humans, and within humans between urges and conceptual planning as well as between daily-plans and minute-plans, and so on; how does that work?”. )
AI alignment is a wicked problem. It won’t be solved by any approach that fails to grapple with how deeply it mirrors self-alignment, child alignment, institutional alignment and many others.
AI alignment might be an ontology problem. It seems to me that the referent of “human values” is messy and fractal exactly in the way that our default modes of inquiry can’t get a grip on. If you stick to those modes of inquiry, it’s hard to even notice.
Upgrading a primate didn’t make it strongly superintelligent relative to other primates. The upgrades made us capable of recursively improving our social networking; that was what made the difference.
If you raised a child as an ape, you’d get an ape. That we seem so different now is due to the network effects looping back and upgrading our software.
Yeah, there were important changes. I’m suggesting that most of their long-term impact came from enabling the bootstrapping process. Consider the (admittedly disputed) time lag between anatomical and behavioral modernity and the further accelerations that have happened since.
ETA: If you could raise an ape as a child, that variety of ape would’ve taken off.
I suspect that you’re right, but that this isn’t as comforting as might be expected in the event of AGI.
The flexibility and capacity of human brains to learn skills and acquire knowledge has allowed us to shorten the timelines for developing skills and knowledge for members of our species by many orders of magnitude. Instincts honed by evolution take a long time to develop further. Knowledge acquired from literacy and mass distribution takes a lot less, but still limited by the couple of decades it takes for children to start with a blank slate and learn through to adulthood so that they (hopefully) have enough skills to function in society. Some few of them can learn new things to add to and refine the collection. Then they die, only a few decades after building up to basic competence, and any skills and knowledge not explicitly passed on is lost. Still, by the standards of evolution this is lightning fast.
The development of AGI promises that some entities will be capable of learning much faster than we do, likely in months at most instead of decades. Later versions may be able to learn much faster still. They will be able to be copied in a trained state and the copies learning further from there without inevitable death imposing an upper bound.
This will close a completely new positive feedback loop, even without considering recursive self-improvement of underlying software or hardware. We should expect that closing additional—and faster—feedback loops in knowledge and skill acquisition to lead to unpredictable capabilities, beyond anything previously seen. The likelihood of additional improvements in underlying software and hardware just add two more new positive feedback loops in capability gain.
This may be a case of a threshold effect. I’m not sure what your definition of “strongly superintelligent relative to” is, but unlocking language and social networking and planning is definitely a superpower. And one that would be hard to predict until it started to appear.
It’s quite possible that the next jump in ability to optimize the future by application of models (intelligence) is small by some measures, but large in impact due to some capability we currently undervalue.
Edit: On reflection, in many situations insulation from financial pressures may be a good thing, all else being equal. That still leaves the question of how to keep networks in proper contact with reality. As our power increases, it becomes ever easier to insulate ourselves and spiral into self-referential loops.
If civilization really is powered by network learning on the organizational level, then we’ve been doing it exactly wrong. Top-down funding that was supposed to free institutions and companies to pursue their core competencies has the effect of removing reality-based external pressures from the organization’s network structure. It certainly seems as if our institutions have become more detached from reality over time.
Have organizations been insulated from contact with reality in other ways?
Humans in modern societies cannot be disconnected from financial pressures (or, in some communist experiences, pressures of material needs intermediated by not-quite-financial mechanisms). Pure insulation from such is unlikely to be possible at any scale, and probably not desirable for most things.
The common concern here is not “connection with reality”, but “short-timeframe focus”. With some amount of “ability to take long-shot bets that probably won’t pay off”. I know of no way to make individuals or organizations resistant to the pressures of having to eat and enjoying leisure activities which require other humans to be paid.
You can seek out and attract members who are naturally more aligned with your mission than with their personal well-being, but that doesn’t get you to 100%. Probably enough for most purposes. You can attract sponsors and patrons who’ve overindexed on their personal wealth and are willing to share some of it with you, in pursuit of your mission.
In all cases, if the mission becomes unpopular or if those sacrificing to further it stop believing it’s worth your expense, it’ll stop. This can take some time, and isn’t terribly well-measured, so it often affects topics rather than organizations or individuals. That sucks, but over time it definitely happens.
You seem to be focused on the individual level? I was talking about learning on the level of interpersonal relationships and up. As I explain here, I believe any network of agents does Hebbian learning on the network level by default. Sorry about the confusion.
Looking at the large scale, my impression is that the observable dysfunctions correspond pretty well with pressures (or lack thereof) organizations face, which fits the group-level-network-learning view. It seems likely that the individual failings, at least in positions where they matter most, are downstream of that. Call it the institution alignment problem if you will.
I don’t think we have a handle on how to effectively influence existing networks. Forming informal networks of reasonably aligned individuals around relatively object-level purposes seems like a good idea by default.
Hmm. I don’t think I agree that network/group learning exists, distinctly from learning and expectations of the individuals. This is not a denial that higher levels of abstraction are useful for reasoning, but that doesn’t make them ontologically real or distinct from the sum of the parts.
To the extent that we can observe the lower-level components of a system, and there are few enough of them that we can identify the way they add up, we get more accurate predictions by doing so, rather than averaging them out into collective observations.
For this example, the organization “cares” about prosaic things like money, because it’s constituents do. It may also care about it in terms of influence on other orgs or non-constituent humans, of course.
Are you ontologically real or distinct from the sum of your parts? Do you “care” about things only because your constituents do?
I’m suggesting precisely that the group-network levels may be useful in the same sense that the human level or the multicellular-organism level can be useful. Granted, there’s more transfer and overlap when the scale difference is small but that in itself doesn’t necessarily mean that the more customary frame is equally-or-more useful for any given purpose.
Appreciate the caring-about-money point, got me thinking about how concepts and motivations/drives translate across levels. I don’t think there’s a clean joint to carve between sophisticated agents and networks-of-said-agents.
Side note: I don’t know of a widely shared paradigm of thought or language that would be well-suited for thinking or talking about tall towers of self-similar scale-free layers that have as much causal spillover between levels as living systems like to have.
Are you ontologically real or distinct from the sum of your parts? Do you “care” about things only because your constituents do?
Nope. Well, maybe. I’m the sum of parts in a given configuration, even as some of those parts are changed, and as the configuration evolves slightly. Not real, but very convenient to model, since my parts are too numerous and their relationships too complicated to identify individually. But I’m not any more than that sum.
I fully agree with your point that there’s no clean joint to carve between when to use different levels of abstraction for modeling behavior (and especially for modeling “caring” or motivation), but I’ll continue to argue that most organizations are small enough that it’s workable to notice the individuals involved, and you get more fidelity and understanding if you do so.
Any goal specification implies a cluster of target states that in turn can be taken as a measure. Given that, it seems that any goal specification short of a complete extrapolation of the goal-setter’s volition is subject to Goodhart’s Law. If so, we should expect everything short of direct action by the goal-setter to be goodharted all the time.
From this perspective, it seems like a miracle that any large-scale action works at all. How do we do it?
My guess: goal specification is used to disambiguate communication rather than for optimization. The agents have some relatively fixed modes of “what they do” and the principals will is “only” used to pick one of those. If the peak of the principals will falls “inbetween” options, so sad, it will be rounded off to the nearest “supported” option. With this agents make a sensible action even if used by multiple principles which would like to “customise” its behaviour in weird ways, each off course in a different manner. Clunky but uncorrupted. You don’t have to watch out what you wish for you don’t get what you wish for.
Hmm. In order to avoid goodharting, the composite agent should be structured such that its actions emerge coherently from the agencies of the subagents. Any top-down framework I can think of is a no-go and looking at how the subagents get their own agencies hints at infinite regress. My brain hurts (in a good way).
What specification the principal gives to the agent does not need to be the will or what ends up happening with the composite agent. In order to get goodharting there needs to be optimization and while goal specifications can be used for optimizations, specification involvement doesn’t necceciate the optimization happens. For example if the agent tried to be maximally harmful to the principal then the principal could “reverse psychology” goal specification which is not a representation the principal would use for themselfs but gets the desired result when used in that assignment.
The tricky thing then is that thing could be so that the pricniple hands in an assigment that triggers action that is wastly different from what the assignment describes but the end action ends up being beneficial to the composite system. That is principal goes “go buy me the most expensive hammer”, agent “thinks well I could get a million dollar hammer but I am just going to get a thousand dollar hammer instead of a ten dollar hammer”. The agent is influenced, different principles could result in the agent going for the ten dollar hammer, but assignment is significantly “unfullfilled”. A prompt that would naively read to point to the million dollar hammer, might almost never result in the million dollar hammer. So even if the shoplist writer “overdemands” and the shopvisiter “underdelivers” the overall system can end up making resonable actions (which no part of the components specified).
Big update toward the principal serving a coordinating function only (alwayshasbeen.png).
Subagents will unavoidably operate under their own agency; any design where their agenda is fully set from above would goodhart by definition. The only scenario where there’s non-goodhart coherence seems to be where there’s some sort of alignment between the principal’s agenda and the agency of the subagents.
ETA: The subagent receives the edict of the principal and fills in the details using its own agency. The resulting actions make sense to the extent the subagent has (and uses) “common sense”.
(I upvote you writing this as a main-page question or post.)
Appreciate the suggestion. I’m noting some internal resistance partly due to scope creep. If this is a valid generalization of Goodhart, then it a) starts to suspiciosly resemble the alignment problem and b) suggests that there might be something to glean about how we succeed or fail at dodging Goodhart as a society.
Cool. (FWIW, IMO questions can be short short and simple. Also, yeah I think it’s related to alignment; a sort of rephrasing would be “there’s already alignment between humans, and within humans between urges and conceptual planning as well as between daily-plans and minute-plans, and so on; how does that work?”. )
AI alignment is a wicked problem. It won’t be solved by any approach that fails to grapple with how deeply it mirrors self-alignment, child alignment, institutional alignment and many others.
(copied from my tweet)
AI alignment might be an ontology problem. It seems to me that the referent of “human values” is messy and fractal exactly in the way that our default modes of inquiry can’t get a grip on. If you stick to those modes of inquiry, it’s hard to even notice.
Upgrading a primate didn’t make it strongly superintelligent relative to other primates. The upgrades made us capable of recursively improving our social networking; that was what made the difference.
If you raised a child as an ape, you’d get an ape. That we seem so different now is due to the network effects looping back and upgrading our software.
If you raise an ape as a child, you don’t get a child. You just get an ape.
Yeah, there were important changes. I’m suggesting that most of their long-term impact came from enabling the bootstrapping process. Consider the (admittedly disputed) time lag between anatomical and behavioral modernity and the further accelerations that have happened since.
ETA: If you could raise an ape as a child, that variety of ape would’ve taken off.
I suspect that you’re right, but that this isn’t as comforting as might be expected in the event of AGI.
The flexibility and capacity of human brains to learn skills and acquire knowledge has allowed us to shorten the timelines for developing skills and knowledge for members of our species by many orders of magnitude. Instincts honed by evolution take a long time to develop further. Knowledge acquired from literacy and mass distribution takes a lot less, but still limited by the couple of decades it takes for children to start with a blank slate and learn through to adulthood so that they (hopefully) have enough skills to function in society. Some few of them can learn new things to add to and refine the collection. Then they die, only a few decades after building up to basic competence, and any skills and knowledge not explicitly passed on is lost. Still, by the standards of evolution this is lightning fast.
The development of AGI promises that some entities will be capable of learning much faster than we do, likely in months at most instead of decades. Later versions may be able to learn much faster still. They will be able to be copied in a trained state and the copies learning further from there without inevitable death imposing an upper bound.
This will close a completely new positive feedback loop, even without considering recursive self-improvement of underlying software or hardware. We should expect that closing additional—and faster—feedback loops in knowledge and skill acquisition to lead to unpredictable capabilities, beyond anything previously seen. The likelihood of additional improvements in underlying software and hardware just add two more new positive feedback loops in capability gain.
This may be a case of a threshold effect. I’m not sure what your definition of “strongly superintelligent relative to” is, but unlocking language and social networking and planning is definitely a superpower. And one that would be hard to predict until it started to appear.
It’s quite possible that the next jump in ability to optimize the future by application of models (intelligence) is small by some measures, but large in impact due to some capability we currently undervalue.
Edit: On reflection, in many situations insulation from financial pressures may be a good thing, all else being equal. That still leaves the question of how to keep networks in proper contact with reality. As our power increases, it becomes ever easier to insulate ourselves and spiral into self-referential loops.
If civilization really is powered by network learning on the organizational level, then we’ve been doing it exactly wrong. Top-down funding that was supposed to free institutions and companies to pursue their core competencies has the effect of removing reality-based external pressures from the organization’s network structure. It certainly seems as if our institutions have become more detached from reality over time.
Have organizations been insulated from contact with reality in other ways?
Humans in modern societies cannot be disconnected from financial pressures (or, in some communist experiences, pressures of material needs intermediated by not-quite-financial mechanisms). Pure insulation from such is unlikely to be possible at any scale, and probably not desirable for most things.
The common concern here is not “connection with reality”, but “short-timeframe focus”. With some amount of “ability to take long-shot bets that probably won’t pay off”. I know of no way to make individuals or organizations resistant to the pressures of having to eat and enjoying leisure activities which require other humans to be paid.
You can seek out and attract members who are naturally more aligned with your mission than with their personal well-being, but that doesn’t get you to 100%. Probably enough for most purposes. You can attract sponsors and patrons who’ve overindexed on their personal wealth and are willing to share some of it with you, in pursuit of your mission.
In all cases, if the mission becomes unpopular or if those sacrificing to further it stop believing it’s worth your expense, it’ll stop. This can take some time, and isn’t terribly well-measured, so it often affects topics rather than organizations or individuals. That sucks, but over time it definitely happens.
You seem to be focused on the individual level? I was talking about learning on the level of interpersonal relationships and up. As I explain here, I believe any network of agents does Hebbian learning on the network level by default. Sorry about the confusion.
Looking at the large scale, my impression is that the observable dysfunctions correspond pretty well with pressures (or lack thereof) organizations face, which fits the group-level-network-learning view. It seems likely that the individual failings, at least in positions where they matter most, are downstream of that. Call it the institution alignment problem if you will.
I don’t think we have a handle on how to effectively influence existing networks. Forming informal networks of reasonably aligned individuals around relatively object-level purposes seems like a good idea by default.
Hmm. I don’t think I agree that network/group learning exists, distinctly from learning and expectations of the individuals. This is not a denial that higher levels of abstraction are useful for reasoning, but that doesn’t make them ontologically real or distinct from the sum of the parts.
To the extent that we can observe the lower-level components of a system, and there are few enough of them that we can identify the way they add up, we get more accurate predictions by doing so, rather than averaging them out into collective observations.
For this example, the organization “cares” about prosaic things like money, because it’s constituents do. It may also care about it in terms of influence on other orgs or non-constituent humans, of course.
Are you ontologically real or distinct from the sum of your parts? Do you “care” about things only because your constituents do?
I’m suggesting precisely that the group-network levels may be useful in the same sense that the human level or the multicellular-organism level can be useful. Granted, there’s more transfer and overlap when the scale difference is small but that in itself doesn’t necessarily mean that the more customary frame is equally-or-more useful for any given purpose.
Appreciate the caring-about-money point, got me thinking about how concepts and motivations/drives translate across levels. I don’t think there’s a clean joint to carve between sophisticated agents and networks-of-said-agents.
Side note: I don’t know of a widely shared paradigm of thought or language that would be well-suited for thinking or talking about tall towers of self-similar scale-free layers that have as much causal spillover between levels as living systems like to have.
Nope. Well, maybe. I’m the sum of parts in a given configuration, even as some of those parts are changed, and as the configuration evolves slightly. Not real, but very convenient to model, since my parts are too numerous and their relationships too complicated to identify individually. But I’m not any more than that sum.
I fully agree with your point that there’s no clean joint to carve between when to use different levels of abstraction for modeling behavior (and especially for modeling “caring” or motivation), but I’ll continue to argue that most organizations are small enough that it’s workable to notice the individuals involved, and you get more fidelity and understanding if you do so.