Apologies if this has been said, but the reading level of this essay is stunningly high. I’ve read rationality A-Z and I can barely follow passages. For example
This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again: the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions. This is sufficient on its own, even ignoring many other items on this list, to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.
I think Yud means here is our genes had a base objective of reproducing themselves. The genes wanted their humans to make babies which were also reproductively fit. But “real-world bounded optimization process” produced humans that sought different things, like sexual pleasure and food and alliances with powerful peers. In the early environment that worked because sex lead to babies and food lead to healthy babies and alliances lead to protection for the babies. But once we built civilization we started having sex with birth control as an end in itself, even letting it distract us from the baby-making objectives. So the genes had this goal but the mesa-optimizer (humans) was only aligned in one environment. When the environment changed it lost alignment. We can expect the same to happen to our AI.
Okay, I think I get it. But there are so few people on the planet that can parse this passage.
Has someone written a more accessible version of this yet?
Apologies if this has been said, but the reading level of this essay is stunningly high. I’ve read rationality A-Z and I can barely follow passages. For example
I think Yud means here is our genes had a base objective of reproducing themselves. The genes wanted their humans to make babies which were also reproductively fit. But “real-world bounded optimization process” produced humans that sought different things, like sexual pleasure and food and alliances with powerful peers. In the early environment that worked because sex lead to babies and food lead to healthy babies and alliances lead to protection for the babies. But once we built civilization we started having sex with birth control as an end in itself, even letting it distract us from the baby-making objectives. So the genes had this goal but the mesa-optimizer (humans) was only aligned in one environment. When the environment changed it lost alignment. We can expect the same to happen to our AI.
Okay, I think I get it. But there are so few people on the planet that can parse this passage.
Has someone written a more accessible version of this yet?
Your summary sounds good to me. https://astralcodexten.substack.com/p/deceptively-aligned-mesa-optimizers?s=r might be a good source for explaining some of the terms like “inner-aligned”?