Shell games
[Metadata: crossposted from https://tsvibt.blogspot.com/2022/11/shell-games.html. First completed November 18, 2022.]
Shell game
Here’s the classic shell game: Youtube
Screenshot from that video.
The little ball is a phantom: when you look for it under a specific shell, it’s not there, it’s under a different shell.
(This might be where the name “shell company” comes from: the business dealings are definitely somewhere, just not in this company you’re looking at.)
Perpetual motion machines
Related: Perpetual motion beliefs
Bhāskara’s wheel is a proposed perpetual-motion machine from the Middle Ages:
Here’s another version:
From this video.
Someone could try arguing that this really is a perpetual motion machine:
Q: How do the bars get lifted up? What does the work to lift them?
A: By the bars on the other side pulling down.
Q: How does the wheel keep turning? How do the bars pull more on their way down than on their way up?
A: Because they’re extended further from the center on the downward-moving side than on the upward-moving side, so they apply more torque to the wheel.
Q: How do the bars extend further on the way down?
A: Because the momentum of the wheel carries them into the vertical bar, flipping them over.
Q: But when that happens, energy is expended to lift up the little weights; that energy comes out of the kinetic energy of the wheel.
A: Ok, you’re right, but that’s not necessary to the design. All we need is that the torque on the downward side is greater than the torque on the upward side, so instead of flipping the weights up, we could tweak the mechanism to just shift them outward, straight to the side. That doesn’t take any energy because it’s just going straight sideways, from a resting position to another resting position.
Q: Yeah… you can shift them sideways with nearly zero work… but that means the weights are attached to the wheel at a pivot, right? So they’ll just fall back and won’t provide more torque.
A: They don’t pivot, you fix them in place so they provide more torque.
Q: Ok, but then when do you push the weights back inward?
A: At the bottom.
Q: When the weight is at the bottom? But then the slider isn’t horizontal, so pushing the weight back towards the center is pushing it upward, which takes work.
A: I meant, when the slider is at the bottom—when it’s horizontal.
Q: But if the sliders are fixed in place, by the time they’re horizontal at the bottom, you’ve already lifted the weights back up some amount; they’re strong-torquing the other way.
A: At the bottom there’s a guide ramp to lift the weights using normal force.
Q: But the guide ramp is also torquing the wheel.
And so on. The inventor can play hide the torque and hide the work.
Shell games in alignment
Some alignment schemes—schemes for structuring or training an AGI so that it can be transformatively useful and doesn’t kill everyone—are prone to playing shell games. That is, there’s some features of the scheme that don’t seem to happen in a specific place; they happen somewhere other than where you’re looking at the moment. Consider these questions:
-
What sort of smarter-than-human work is supposed to be done by the AGI? When and how does it do that work—by what combination of parts across time?
-
How does it become able to do that work? At what points does the AGI come to new understanding that it didn’t have before?
-
How does the AGI orchestrate it’s thinking and actions to have large effects on the world? By what process, components, rules, or other elements?
-
What determines the direction that the AGI’s actions will push the world? Where did those determiners come from, and how exactly do they determine the direction?
-
Where and how much do human operators have to make judgements? How much are those judgements being relied on to point to goodness, truth, alignedness, safety? How much interpretive work is the AI system supposed to be doing?
If these questions don’t have fixed answers, there might be a shell game being played to hide the cognitive work, hide the agency, hide the good judgement. (Or there might not be; there could be good ideas that can’t answer these questions specifically, e.g. like how a building might hold up even though the load would be borne by different beams depending on which objects are placed where inside.)
Example: hiding the generator of large effects
For example, sometimes an AGI alignment scheme has a bunch of parts, and any given part is claimed to be far from intelligent and not able to push the world around much, and the system as a whole is claimed to be potentially very intelligent and able to transform the world. This isn’t by itself necessarily a problem; e.g. a brain is an intelligent system made of neurons, which aren’t themselves able to push the world around much.
But [the fact that the whole system is aligned] can’t be deduced from the parts being weak, because at some point, whether from a combined dynamic of multiple parts or actually from just one of the parts after all, the system has to figure out how to push the world around. [Wherever it happens that the system figures out how to push the world around] has to be understood in more detail to have a hope of understanding what it’s aligned to. So if the alignment scheme’s reason for being safe is always that each particular part is weak, a shell game might be being played with the source of the system’s ability to greatly affect the world.
Example: hiding the generator of novel understanding
Another example is shuffling creativity between train time and inference time (as the system is described—whether or not that division is actually a right division to make about minds).
If an AGI learns to do very novel tasks in very novel contexts, then it has to come to understand a lot of novel structure. One might argue that some AGI training system will produce good outcomes because the model is trained to use its understanding to affect the world in ways the humans would like. But this doesn’t explain where the understanding came from.
If the understanding came at inference time, then the alignment story relies on the AGI finding novel understanding without significantly changing what ultimately controls the direction of the effects it has on the world, and relies on the AGI using newly found understanding to have certain effects. That’s a more specific story than just the AGI being trained to use its pre-existing understanding to have certain effects.
If the understanding came at train time, then one has to explain how the training system was able to find that understanding—given that the training procedure doesn’t have access to the details of the new contexts that the system will be applied to when it’s being used to safely transform the world. Maybe one can find pivotal understanding in an inert or aligned form using a visibly safe, non-agentic, known-algorithm non-self-improving training / search program (as opposed, for example, to a nascent AGI “doing its own science or self-improvement”), but that’s an open question and would be a large advance in practical alignment. Without an insight like that, [the training algorithm plus the partially trained system] being postulated may be an impossible combination of safely inert, and able to find new understanding.
Other?
What are other things that could be hidden under shells? What are some alignment proposals that are at risk of playing shell games?
This post suggests an analogy between (some) AI alignment proposals and shell games or perpetuum mobile proposals. Pertuum mobiles are an example how an idea might look sensible to someone with a half-baked understanding of the domain, while remaining very far from anything workable. A clever arguer can (intentionally or not!) hide the error in the design wherever the audience is not looking at any given moment. Similarly, some alignment proposals might seem correct when zooming in on every piece separately, but that’s because the error is always hidden away somewhere else.
I don’t think this adds anything very deep to understanding AI alignment, but it is a cute example how atheoretical analysis can fail catastrophically, especially when the the designer is motivated to argue that their invention works. Conversely, knowledge of a deep theoretical principle can refute a huge swath of design space is a single move. I will remember this for didactic purposes.
Disclaimer: A cute analogy by itself proves little, any individual alignment proposal might be free of such sins, and didactic tools should be used wisely, lest they become soldier-arguments. The author intends this (I think) mostly as a guiding principle for critical analysis of proposals.