Can you say why you think AUP is “pointless” for Alignment? It seems to me attaining cautious behavior out of a reward learner might turn out to be helpful. Overall my intuition is it could turn out to be an essential piece of the puzzle.
I can think of one or two reasons myself, but I barely grasp the finer points of AUP as it is, so speculation on my part here might be counterproductive.
Can you say why you think AUP is “pointless” for Alignment?
Off-the-cuff:
AUP, or any other outer objective function / reward function scheme, relies on having any understanding at all of how to transmute outer reward schedules (e.g. the AUP reward function + training) into internal cognitive structures (e.g. a trained policy which reliably doesn’t take actions which destroy vases) which are stable over time (e.g. the policy “cares about” not turning into an agent which destroys vases).
And if we knew how to do that, we could probably do a lot more exciting things than impact-limited agents; we probably would have just solved a lot of alignment in one fell swoop.
I think I have some ideas of how this happens in people, and how we might do it for AGI.
Even if impact measures worked, I think we really want an AI which can perform a pivotal act, or at least something really important and helpful (Eliezer often talks about the GPU-melting; my private smallest pivotal act is not that, though).
Impact measures probably require big competitiveness hits, which twins with the above point.
I can think of one or two reasons myself, but I barely grasp the finer points of AUP as it is, so speculation on my part here might be counterproductive.
Please go ahead and speculate anyways. Think for yourself as best you can, don’t defer to me, just mark your uncertainties!
I have said nice things about AUP in the past (in past papers I wrote) and I will continue to say them. I can definitely see real-life cases where adding an AUP term to a reward function makes the resulting AI or AGI more aligned. Therefore, I see AUP as a useful and welcome tool in the AI alignment/safety toolbox. Sure, this tool alone does not solve every problem, but that hardly makes it a pointless tool.
From your off-the-cuff remarks, I am guessing that you are currently inhabiting the strange place where ‘pivotal acts’ are your preferred alignment solution. I will grant that, if you are in that place, then AUP might appear more pointless to you than it does to me.
Not sure what I was thinking about, but probably just that my understanding is that “safe AGI via AUP” would have to penalize the agent for learning to achieve anything not directly related to the end goal, and that might make it too difficult to actually achieve the end goal when e.g. it turns out to need tangentially related behavior.
Your “social dynamics” section encouraged me to be bolder sharing my own ideas on this forum, and I wrote up some stuff today that I’ll post soon, so thank you for that!
That was an inspiring and enjoyable read!
Can you say why you think AUP is “pointless” for Alignment? It seems to me attaining cautious behavior out of a reward learner might turn out to be helpful. Overall my intuition is it could turn out to be an essential piece of the puzzle.
I can think of one or two reasons myself, but I barely grasp the finer points of AUP as it is, so speculation on my part here might be counterproductive.
Off-the-cuff:
AUP, or any other outer objective function / reward function scheme, relies on having any understanding at all of how to transmute outer reward schedules (e.g. the AUP reward function + training) into internal cognitive structures (e.g. a trained policy which reliably doesn’t take actions which destroy vases) which are stable over time (e.g. the policy “cares about” not turning into an agent which destroys vases).
And if we knew how to do that, we could probably do a lot more exciting things than impact-limited agents; we probably would have just solved a lot of alignment in one fell swoop.
I think I have some ideas of how this happens in people, and how we might do it for AGI.
Even if impact measures worked, I think we really want an AI which can perform a pivotal act, or at least something really important and helpful (Eliezer often talks about the GPU-melting; my private smallest pivotal act is not that, though).
Impact measures probably require big competitiveness hits, which twins with the above point.
Please go ahead and speculate anyways. Think for yourself as best you can, don’t defer to me, just mark your uncertainties!
I have said nice things about AUP in the past (in past papers I wrote) and I will continue to say them. I can definitely see real-life cases where adding an AUP term to a reward function makes the resulting AI or AGI more aligned. Therefore, I see AUP as a useful and welcome tool in the AI alignment/safety toolbox. Sure, this tool alone does not solve every problem, but that hardly makes it a pointless tool.
From your off-the-cuff remarks, I am guessing that you are currently inhabiting the strange place where ‘pivotal acts’ are your preferred alignment solution. I will grant that, if you are in that place, then AUP might appear more pointless to you than it does to me.
Not sure what I was thinking about, but probably just that my understanding is that “safe AGI via AUP” would have to penalize the agent for learning to achieve anything not directly related to the end goal, and that might make it too difficult to actually achieve the end goal when e.g. it turns out to need tangentially related behavior.
Your “social dynamics” section encouraged me to be bolder sharing my own ideas on this forum, and I wrote up some stuff today that I’ll post soon, so thank you for that!