There’s apparently some controversy over what the Wright brothers were able to infer from studying birds. From Wikipedia:
On the basis of observation, Wilbur concluded that birds changed the angle of the ends of their wings to make their bodies roll right or left.[34] The brothers decided this would also be a good way for a flying machine to turn – to “bank” or “lean” into the turn just like a bird – and just like a person riding a bicycle, an experience with which they were thoroughly familiar. Equally important, they hoped this method would enable recovery when the wind tilted the machine to one side (lateral balance). They puzzled over how to achieve the same effect with man-made wings and eventually discovered wing-warping when Wilbur idly twisted a long inner-tube box at the bicycle shop.[35]
Other aeronautical investigators regarded flight as if it were not so different from surface locomotion, except the surface would be elevated. They thought in terms of a ship’s rudder for steering, while the flying machine remained essentially level in the air, as did a train or an automobile or a ship at the surface. The idea of deliberately leaning, or rolling, to one side seemed either undesirable or did not enter their thinking...
Wilbur claimed they learned about plausible control mechanisms from studying birds. However, ~40 years after their first flight, Orville would go on to contradict that assertion and claim that they weren’t able to draw any useful ideas from birds.
There are many reasons why I think shard theory is a promising line of research. I’ll just list some of them without defending them in any particular depth (that’s what the shard theory sequence is for):
I expect that there’s more convergence in the space of effective learning algorithms than there is in the space of non-learning systems. This is ultimately due to the simplicity prior, which we can apply to the space of learning systems. Those learning algorithms which are best able to generalize are those which are simple, for the same reason that simple hypotheses are more likely to generalize. I thus expect there to be more convergence between the learning dynamics of artificial and natural intelligences than most alignment researchers seem to assume.
The more I think about them, the less weird values seem. They do not look like a hack or a kludge to me, and it seems increasingly likely that a broad class of agentic learning systems will converge to similar values meta-dynamics. I think evolution put essentially zero effort into giving us “unusual” meta-preferences, and that the meta-preferences that we do have are pretty typical in the space of possible learning systems.
To be clear, I’m not saying that AIs will naturally converge to first order human values. I’m saying they’ll have computational structures that have very similar higher-order dynamics to human values, but could be orientated towards completely different things.
I think that imperfect value alignment does not lead to certain doom. I reject the notion of there being any “true” values which exist as ephemeral Platonic ideal inaccessible to our normal introspective process.
Rather, I think we have something like a continuous distribution over possible values which we could instantiate in different circumstances. Much of the felt sense of value fragility arises from a type error of trying to represent a continuous distribution with a finite set of samples from that distribution.
The consequence of this is that it’s possible for two distributions to partially overlap. In contrast, if you think that humans have some finite set of “true” values (which we don’t know), that AIs have some finite set of “true” values (which we can’t control), and that these need to near-perfectly overlap or the AIs will Goodheart away all the future’s value, then the prospects for value alignment would look grim indeed!
I think that inner values are relatively predictable, conditional on knowing the outer optimization criteria and learning environment. Yudkowsky makes frequent reference to how evolution failed to align humans to maximizing inclusive genetic fitness, and that this implies inner values have no predictable relationship with outer optimization criteria. I think it’s a mistake to anchor our expectations of inner / outer value outcomes in AIs to evolution’s outcome in humans. Evidence from human inner / outer value outcomes seems like the much more relevant comparison to me.
Similarly, I don’t think we’ll get a “sharp left turn” from AI training, so I more strongly expect that work on value aligning current AI systems will extend to superhuman AI systems, and that human-like learning dynamics will not totally go out the window once we reach superintelligence.
There’s apparently some controversy over what the Wright brothers were able to infer from studying birds. From Wikipedia:
Wilbur claimed they learned about plausible control mechanisms from studying birds. However, ~40 years after their first flight, Orville would go on to contradict that assertion and claim that they weren’t able to draw any useful ideas from birds.
There are many reasons why I think shard theory is a promising line of research. I’ll just list some of them without defending them in any particular depth (that’s what the shard theory sequence is for):
I expect that there’s more convergence in the space of effective learning algorithms than there is in the space of non-learning systems. This is ultimately due to the simplicity prior, which we can apply to the space of learning systems. Those learning algorithms which are best able to generalize are those which are simple, for the same reason that simple hypotheses are more likely to generalize. I thus expect there to be more convergence between the learning dynamics of artificial and natural intelligences than most alignment researchers seem to assume.
The more I think about them, the less weird values seem. They do not look like a hack or a kludge to me, and it seems increasingly likely that a broad class of agentic learning systems will converge to similar values meta-dynamics. I think evolution put essentially zero effort into giving us “unusual” meta-preferences, and that the meta-preferences that we do have are pretty typical in the space of possible learning systems.
To be clear, I’m not saying that AIs will naturally converge to first order human values. I’m saying they’ll have computational structures that have very similar higher-order dynamics to human values, but could be orientated towards completely different things.
I think that imperfect value alignment does not lead to certain doom. I reject the notion of there being any “true” values which exist as ephemeral Platonic ideal inaccessible to our normal introspective process.
Rather, I think we have something like a continuous distribution over possible values which we could instantiate in different circumstances. Much of the felt sense of value fragility arises from a type error of trying to represent a continuous distribution with a finite set of samples from that distribution.
The consequence of this is that it’s possible for two distributions to partially overlap. In contrast, if you think that humans have some finite set of “true” values (which we don’t know), that AIs have some finite set of “true” values (which we can’t control), and that these need to near-perfectly overlap or the AIs will Goodheart away all the future’s value, then the prospects for value alignment would look grim indeed!
I think that inner values are relatively predictable, conditional on knowing the outer optimization criteria and learning environment. Yudkowsky makes frequent reference to how evolution failed to align humans to maximizing inclusive genetic fitness, and that this implies inner values have no predictable relationship with outer optimization criteria. I think it’s a mistake to anchor our expectations of inner / outer value outcomes in AIs to evolution’s outcome in humans. Evidence from human inner / outer value outcomes seems like the much more relevant comparison to me.
Similarly, I don’t think we’ll get a “sharp left turn” from AI training, so I more strongly expect that work on value aligning current AI systems will extend to superhuman AI systems, and that human-like learning dynamics will not totally go out the window once we reach superintelligence.
“Does not”?
Yes. “does not” is what I meant.