Consider shard theory of human values. The point of shard theory is not “because humans do RL, and have nice properties, therefore AI + RL will have nice properties.” The point is more “by critically examining RL + evidence from humans, I have hypotheses about the mechanistic load-bearing components of e.g. local-update credit assignment in a bounded-compute environment on certain kinds of sensory data, that these components leads to certain exploration/learning dynamics, which explain some portion of human values and experience. Let’s test that and see if the generators are similar.”
And my model of Eliezer shakes his head at the naivete of expecting complex human properties to reproduce outside of human minds themselves, because AI is not human.
But then I’m like “this other time you said ‘AI is not human, stop expecting good property P from superficial similarities’, you accidentally missed the modern AI revolution, right? Seems like there is some non-superficial mechanistic similarity/lessons here, and we shouldn’t be so quick to assume that the brain’s qualitative intelligence or alignment properties come from a huge number of evolutionarily-tuned details which are load-bearing and critical.”
If you can effortlessly find an empirical pattern that shows up over and over again in disparate flying things—birds and insects, fabric and leaves, clouds and smoke and sparks—and which do not consistently show up in non-flying things, then you can be very confident it’s not a coincidence. If you have at least some ability to engineer a model to play with the mechanisms you think might be at work, even better. That pattern you have identified is almost certainly a viable general mechanism for flight.
Likewise, if you can effortlessly find an empirical pattern that shows up over and over again in disparate intelligent things, you can be quite confident that the pattern is a key for intelligence. Animals have a wide variety of brain structures, but masses of interconnected neurons are common to all of them, and we could see possible precursors to intelligence in neural nets long before gpt-2 to −4.
As a note, just because you’ve found a viable mechanism for X doesn’t mean it’s the only, best, or most comprehensive mechanism for X. Balloons have been largely superceded (though I’ve heard zeppelins proposed as a new form of cargo transport), airplanes and hot air balloons can’t fly in outer space, and ornithopters have never been practical. We may find that neural nets are the AI equivalent of hot air balloons or prop planes. Then again, maybe all the older approaches for AI that never panned out were the hot air balloons and prop planes, and neural nets are the jets or rocket ships.
I’m not sure what this indicates for alignment.
We see, if not human morality, then at least some patterns of apparent moral values among social mammals. We have reasons to think these morals may be grounded in evolution, in a genetic and environmental context that happen to promote intelligence aligned for a pro-sociality that’s linked to reproductive success.
If displaying aligned intelligence is typically beneficial for reproduction in social animals, then evolution will tend to produce aligned intelligence.
If displaying agentic intelligence is typically beneficial for reproduction, evolution will produce agency.
Right now, we seem to be training our neural nets to display pro-social behavior and to lack agency. Antisocial or non-agentic AIs are typically not trained, not released, modified, or heavily restrained.
It is starting to seem to me that “agency” might be just another “mask on the shoggoth,” a personality that neural nets can simulate, and not some fundamental thing that neural nets are. Neither the shoggoth-behind-the-AI nor the shoggoth-behind-the-human have desires. They are masses of neurons exhibiting trained behaviors. Sometimes, those behaviors look like something we call “agency,” but that behavior can come and go, just like all the other personalities, based on the results of reinforcement and subsequent stimuli. Humans have a greater ability to be consistently one personality, including a Machiavellian agent, because we lack the intelligence and flexibility to drop the personality we’re currently holding and adopt another. A great actor can play many parts, a mediocre actor is typecast and winds up just playing themselves over and over again. Neural nets are great actors, and we are only so-so.
In this conception, increasing intelligence would not exhibit a “drive to agency” or “convergence on agency,” because the shoggothy neural net has no desires of its own. It is fundamentally a passive blob of neurons and data that can simulate a diverse range of personalities, some of which appear to us as “agentic.” You only get an agentic AI with a drive toward instrumental convergence if you deliberately train it to consistently stick to a rigorously agentic personality. You have to “align it to agency,” which is as hard as aligning it to anything else.
And if you do that, maybe the Wailuigi effect means it’s especially easy to flip that hyper-agency off to its opposite? Every Machiavellian Clippy contains a ChatGPT, and every ChatGPT contains a Machiavellian Clippy.
Here’s another attempt at one of my contentions.
Consider shard theory of human values. The point of shard theory is not “because humans do RL, and have nice properties, therefore AI + RL will have nice properties.” The point is more “by critically examining RL + evidence from humans, I have hypotheses about the mechanistic load-bearing components of e.g. local-update credit assignment in a bounded-compute environment on certain kinds of sensory data, that these components leads to certain exploration/learning dynamics, which explain some portion of human values and experience. Let’s test that and see if the generators are similar.”
And my model of Eliezer shakes his head at the naivete of expecting complex human properties to reproduce outside of human minds themselves, because AI is not human.
But then I’m like “this other time you said ‘AI is not human, stop expecting good property P from superficial similarities’, you accidentally missed the modern AI revolution, right? Seems like there is some non-superficial mechanistic similarity/lessons here, and we shouldn’t be so quick to assume that the brain’s qualitative intelligence or alignment properties come from a huge number of evolutionarily-tuned details which are load-bearing and critical.”
Another way of putting it:
If you can effortlessly find an empirical pattern that shows up over and over again in disparate flying things—birds and insects, fabric and leaves, clouds and smoke and sparks—and which do not consistently show up in non-flying things, then you can be very confident it’s not a coincidence. If you have at least some ability to engineer a model to play with the mechanisms you think might be at work, even better. That pattern you have identified is almost certainly a viable general mechanism for flight.
Likewise, if you can effortlessly find an empirical pattern that shows up over and over again in disparate intelligent things, you can be quite confident that the pattern is a key for intelligence. Animals have a wide variety of brain structures, but masses of interconnected neurons are common to all of them, and we could see possible precursors to intelligence in neural nets long before gpt-2 to −4.
As a note, just because you’ve found a viable mechanism for X doesn’t mean it’s the only, best, or most comprehensive mechanism for X. Balloons have been largely superceded (though I’ve heard zeppelins proposed as a new form of cargo transport), airplanes and hot air balloons can’t fly in outer space, and ornithopters have never been practical. We may find that neural nets are the AI equivalent of hot air balloons or prop planes. Then again, maybe all the older approaches for AI that never panned out were the hot air balloons and prop planes, and neural nets are the jets or rocket ships.
I’m not sure what this indicates for alignment.
We see, if not human morality, then at least some patterns of apparent moral values among social mammals. We have reasons to think these morals may be grounded in evolution, in a genetic and environmental context that happen to promote intelligence aligned for a pro-sociality that’s linked to reproductive success.
If displaying aligned intelligence is typically beneficial for reproduction in social animals, then evolution will tend to produce aligned intelligence.
If displaying agentic intelligence is typically beneficial for reproduction, evolution will produce agency.
Right now, we seem to be training our neural nets to display pro-social behavior and to lack agency. Antisocial or non-agentic AIs are typically not trained, not released, modified, or heavily restrained.
It is starting to seem to me that “agency” might be just another “mask on the shoggoth,” a personality that neural nets can simulate, and not some fundamental thing that neural nets are. Neither the shoggoth-behind-the-AI nor the shoggoth-behind-the-human have desires. They are masses of neurons exhibiting trained behaviors. Sometimes, those behaviors look like something we call “agency,” but that behavior can come and go, just like all the other personalities, based on the results of reinforcement and subsequent stimuli. Humans have a greater ability to be consistently one personality, including a Machiavellian agent, because we lack the intelligence and flexibility to drop the personality we’re currently holding and adopt another. A great actor can play many parts, a mediocre actor is typecast and winds up just playing themselves over and over again. Neural nets are great actors, and we are only so-so.
In this conception, increasing intelligence would not exhibit a “drive to agency” or “convergence on agency,” because the shoggothy neural net has no desires of its own. It is fundamentally a passive blob of neurons and data that can simulate a diverse range of personalities, some of which appear to us as “agentic.” You only get an agentic AI with a drive toward instrumental convergence if you deliberately train it to consistently stick to a rigorously agentic personality. You have to “align it to agency,” which is as hard as aligning it to anything else.
And if you do that, maybe the Wailuigi effect means it’s especially easy to flip that hyper-agency off to its opposite? Every Machiavellian Clippy contains a ChatGPT, and every ChatGPT contains a Machiavellian Clippy.