These two sentences from Section B.2 stuck out to me as the most important in the post:
...outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction.
...on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they’re there, rather than just observable outer ones you can run a loss function over.
My question is: supposing this is all true, what is the probability of failure of inner alignment? Is it 0.01%, 99.99%, 50%...? And how do we know how likely failure is?
It seems like there is a gulf between “it’s not guaranteed to work” and “it’s almost certain to fail”.
I think any reasonable estimate would be based on a more detailed plan. What types of rewards (loss function) are we providing, and what type of inner alignment we want.
My intuition roughly aligns with Eliezer’s on this point: I doubt this will work.
When I imagine rewarding an agent for doing things humans like, as indicated by smiles, thanks, etc. I have a hard time imagining that this just generalizes to an agent that does what we want, even in very different circumstances, including when it can relatively easily gain sovereignty and do whatever it wants.
Others have a different intuition. Buried in a comment somewhere from Quintin Pope, he says something to the effect of “shard theory isn’t a new theory of alignment; it’s the hypothesis that we dont need one”. I think he and other shard theory optimists think it’s entirely plausible that rewarding stuff we like will develop inner representations and alignment that’s adequate for our purposes.
While I share Eliezar and others’ pessimism about alignment through pure RL, I don’t share his overall pessimism. You’ve seen my alternate proposals for directly setting desirable goals out of an agent’s learned knowledge.
It’s almost certain in the narrow technical sense of “some difference no matter how small”, and unknown (and currently undefinable) in any more useful sense.
I have a question about “AGI Ruin: A List of Lethalities”.
These two sentences from Section B.2 stuck out to me as the most important in the post:
My question is: supposing this is all true, what is the probability of failure of inner alignment? Is it 0.01%, 99.99%, 50%...? And how do we know how likely failure is?
It seems like there is a gulf between “it’s not guaranteed to work” and “it’s almost certain to fail”.
I think any reasonable estimate would be based on a more detailed plan. What types of rewards (loss function) are we providing, and what type of inner alignment we want.
My intuition roughly aligns with Eliezer’s on this point: I doubt this will work.
When I imagine rewarding an agent for doing things humans like, as indicated by smiles, thanks, etc. I have a hard time imagining that this just generalizes to an agent that does what we want, even in very different circumstances, including when it can relatively easily gain sovereignty and do whatever it wants.
Others have a different intuition. Buried in a comment somewhere from Quintin Pope, he says something to the effect of “shard theory isn’t a new theory of alignment; it’s the hypothesis that we dont need one”. I think he and other shard theory optimists think it’s entirely plausible that rewarding stuff we like will develop inner representations and alignment that’s adequate for our purposes.
While I share Eliezar and others’ pessimism about alignment through pure RL, I don’t share his overall pessimism. You’ve seen my alternate proposals for directly setting desirable goals out of an agent’s learned knowledge.
It’s almost certain in the narrow technical sense of “some difference no matter how small”, and unknown (and currently undefinable) in any more useful sense.