Back-of-the-envelope probability estimate of alignment-by-default via a certain shard-theoretic pathway. The following is what I said in a conversation discussing the plausibility of a proto-AGI picking up a “care about people” shard from the data, and retaining that value even through reflection. I was pushing back against a sentiment like “it’s totally improbable, from our current uncertainty, for AIs to retain caring-about-people shards. This is only one story among billions.”
Here’s some of what I had to say:
[Let’s reconsider the five-step mechanistic story I made up.] I’d give the following conditional probabilities (made up with about 5 seconds of thought each):
1. Humans in fact care about other humans, in a way which extrapolates to quasi-humans still being around (whatever that means) P(1)=.85
2. Human-generated data makes up a large portion of the corpus, and having a correct model of them is important for “achieving low loss”,[1] so the AI has a model of how people want things P(2 | 1) = .6, could have different abstractions or have learned these models later in training once key decision-influences are already there
3. During RL finetuning and given this post-unsupervised initialization, there’s now an inductive bias towards just hooking human-like criteria for bidding on internal-AI-plans. IE humans give approval-based reinforcement, and an inductively easy way of upweighting logits on those actions is just hook up the human-like plan-criteria into the AI’s planning process, so the AI gets a humanlike “care about people” shard. P(3 | 2, 1) = .55 due to plurality of value, I expect this to be one way it learns to make decisions
4. The AI bids on plans in substantial part on the basis of these criteria, including when it navigates reflection P(4 | 3, 2, 1) = .25, 50% it’s a strong shard and 50% that it survives various reflection difficulties; part of the latter estimate comes from intuitions around formed-by-positive-reward shards seeming more likely to be reflectively endorsed
5. Therefore, the AI is bidding on plans where it keeps making decisions on the basis of “care about people” by instrumental convergence: goal preservation P(5 | 4, 3, 2, 1) = .95, given that it already reflectively endorses this shard, I think the standard instrumental convergence arguments apply
Conclusion: The AI has reflected and still cares about people. Overall estimate: .67%
My estimate here came out a biiit lower than I had expected (~1%), but it also is (by my estimation) far more probable than most of the billions of possible 5-step claims about how the final cognition ends up. I think it’s reasonable to expect there to be about 5 or 6 stories like this from similar causes, which would make it not crazy to have ~3% on something like this happening given the amount of alignment effort described (i.e. pretty low-effort).
That said, I’m wary of putting 20% on this class of story, and a little more leery of 10% after running these numbers.
(I suspect that the alter-Alex who had socially-wanted to argue the other side—that the probability was low—would have come out to about .2% or .3%. For a few of the items, I tried pretending that I was instead slightly emotionally invested in the argument going the other way, and hopefully that helped my estimates be less biased. I wouldn’t be surprised if some of these numbers are a bit higher than I’d endorse from a perfectly neutral standpoint.)
(I also don’t have strong faith in my ability to deal with heavily conjunctive scenarios like this, i expect I could be made to make numbers for event A come out lower if described as ‘A happens in 5 steps’ compared to ‘A happens in 3 steps’)
This is noted shorthand for extra reasoning which would have to be done; “low loss” is not a very great reason to expect anything to happen IMO. But I wanted to box that reasoning for this discussion. I think there’s probably a more complicated and elegant mapping by which a training corpus leaves its fingerprints in the trained AI’s concepts.
This seems like an underestimate because you don’t consider whether the first “AGI” will indeed make it so we only get one chance. If it can only self improve by more gradient steps, then humanity has a greater chance than if it self improves by prompt engineering or direct modification of its weights or latent states. Shard theory seems to have nonzero opinions on the fruitfulness of the non-data methods.
What does self-improvement via gradients vs prompt-engineering vs direct mods have to do with how many chances we get? I guess, we have at least a modicum more control over the gradient feedback loop, than over the other loops?
Shard theory seems to have nonzero opinions on the fruitfulness of the non-data methods.
During RL finetuning and given this post-unsupervised initialization, there’s now an inductive bias towards just hooking human-like criteria for bidding on internal-AI-plans. IE humans give approval-based reinforcement, and an inductively easy way of upweighting logits on those actions is just hook up the human-like plan-criteria into the AI’s planning process, so the AI gets a humanlike “care about people” shard. P(3 | 2, 1) = .55 due to plurality of value, I expect this to be one way it learns to make decisions
This is where I’d put a significantly low probability. Could you elaborate on why there’s an inductive bias towards “just hooking human-like criteria for bidding on internal-AI-plans”? As far as I can tell, the inductive bias for human-like values would be something that at least seems closer to the human-brain structure than any arbitrary ML architecture we have right now. Rewarding a system to better model human beings’ desires doesn’t seem to me to lead it towards having similar desires. I’d use the “instrumental versus terminal desires” concept here but I expect you would consider that something that adds confusion instead of removing it.
Because it’s shorter edit distance in its internal ontology; it’s plausibly NN-simple to take existing plan-grading procedures, internal to the model, and then hooking those more directly into its logit-controllers.
Also note that probably it internally hooks up lots of ways to make decisions, and this only has to be one (substantial) component. Possibly I’d put .3 or .45 now instead of .55 though.
Back-of-the-envelope probability estimate of alignment-by-default via a certain shard-theoretic pathway. The following is what I said in a conversation discussing the plausibility of a proto-AGI picking up a “care about people” shard from the data, and retaining that value even through reflection. I was pushing back against a sentiment like “it’s totally improbable, from our current uncertainty, for AIs to retain caring-about-people shards. This is only one story among billions.”
Here’s some of what I had to say:
[Let’s reconsider the five-step mechanistic story I made up.] I’d give the following conditional probabilities (made up with about 5 seconds of thought each):
My estimate here came out a biiit lower than I had expected (~1%), but it also is (by my estimation) far more probable than most of the billions of possible 5-step claims about how the final cognition ends up. I think it’s reasonable to expect there to be about 5 or 6 stories like this from similar causes, which would make it not crazy to have ~3% on something like this happening given the amount of alignment effort described (i.e. pretty low-effort).
That said, I’m wary of putting 20% on this class of story, and a little more leery of 10% after running these numbers.
(I suspect that the alter-Alex who had socially-wanted to argue the other side—that the probability was low—would have come out to about .2% or .3%. For a few of the items, I tried pretending that I was instead slightly emotionally invested in the argument going the other way, and hopefully that helped my estimates be less biased. I wouldn’t be surprised if some of these numbers are a bit higher than I’d endorse from a perfectly neutral standpoint.)
(I also don’t have strong faith in my ability to deal with heavily conjunctive scenarios like this, i expect I could be made to make numbers for event A come out lower if described as ‘A happens in 5 steps’ compared to ‘A happens in 3 steps’)
This is noted shorthand for extra reasoning which would have to be done; “low loss” is not a very great reason to expect anything to happen IMO. But I wanted to box that reasoning for this discussion. I think there’s probably a more complicated and elegant mapping by which a training corpus leaves its fingerprints in the trained AI’s concepts.
0.85 x 0.6 x 0.55 x 0.25 x 0.95 ≅ 0.067 = 6.7% — I think you slipped an order of magnitude somewhere?
This seems like an underestimate because you don’t consider whether the first “AGI” will indeed make it so we only get one chance. If it can only self improve by more gradient steps, then humanity has a greater chance than if it self improves by prompt engineering or direct modification of its weights or latent states. Shard theory seems to have nonzero opinions on the fruitfulness of the non-data methods.
What does self-improvement via gradients vs prompt-engineering vs direct mods have to do with how many chances we get? I guess, we have at least a modicum more control over the gradient feedback loop, than over the other loops?
Can you say more?
This is where I’d put a significantly low probability. Could you elaborate on why there’s an inductive bias towards “just hooking human-like criteria for bidding on internal-AI-plans”? As far as I can tell, the inductive bias for human-like values would be something that at least seems closer to the human-brain structure than any arbitrary ML architecture we have right now. Rewarding a system to better model human beings’ desires doesn’t seem to me to lead it towards having similar desires. I’d use the “instrumental versus terminal desires” concept here but I expect you would consider that something that adds confusion instead of removing it.
Because it’s shorter edit distance in its internal ontology; it’s plausibly NN-simple to take existing plan-grading procedures, internal to the model, and then hooking those more directly into its logit-controllers.
Also note that probably it internally hooks up lots of ways to make decisions, and this only has to be one (substantial) component. Possibly I’d put .3 or .45 now instead of .55 though.