ARC’s hope is to train AI systems with the loss function “How much do we like the actions proposed by this system?”, in order to produce AI systems that take actions we like.
If human overseers know everything the model knows then this is just RLHF. However we are concerned about cases where the model understands something that the humans do not. In that case, we hope to use ELK to elicit key information that will help us understand the consequences of the AI’s action so that we can decide whether they are good.
(The same difficulty would arise if you were trying to train AI systems to evaluate actions and then searching against those evaluations, which is the case we discuss in the ELK report.)
what happens if this finds a way to satisfy values that the human actually has, but would not have if they had been able to do ELK on their own brain? eg, for example, I’m pretty sure I don’t want to want some things I want, and I’m worried about s-risks from the scaled version of locking in networks of conflicting things people currently truly want but truly wouldn’t want to truly want. eg, I’m pretty sure mine are milder than this, but some people truly want to hurt others in ways the other doesn’t want order to get ahead, and would resist any attempt to remove hurting others. given that these people can each have their own ai amplifier, what tool can be both probabilistically-verifiably trustable but also help both ai and human mutually discover ways to be aligned with others that neither could have discovered on their own?
I’ll be happy if AI gives people time/space/safety to figure out what they want while taking actions in the world that preserve option value.
The kind of AI alignment solution we’re working on isn’t a substitute for people deciding how they want to reflect and develop and decide what they value. The idea is that if AI is going to be part of that process, then the timing and nature of AI involvement should be decided by people rather than by “we need to deploy this AI now in order to remain competitive and accept whatever affects that has on our values.”
You could imagine AI solutions that try to replace the normal process of moral deliberation and reconciliation (rather than simply being a tool to help it), but I’ve never seen a proposal along those lines that didn’t seem really bad to me.
ARC’s hope is to train AI systems with the loss function “How much do we like the actions proposed by this system?”, in order to produce AI systems that take actions we like.
If human overseers know everything the model knows then this is just RLHF. However we are concerned about cases where the model understands something that the humans do not. In that case, we hope to use ELK to elicit key information that will help us understand the consequences of the AI’s action so that we can decide whether they are good.
(The same difficulty would arise if you were trying to train AI systems to evaluate actions and then searching against those evaluations, which is the case we discuss in the ELK report.)
what happens if this finds a way to satisfy values that the human actually has, but would not have if they had been able to do ELK on their own brain? eg, for example, I’m pretty sure I don’t want to want some things I want, and I’m worried about s-risks from the scaled version of locking in networks of conflicting things people currently truly want but truly wouldn’t want to truly want. eg, I’m pretty sure mine are milder than this, but some people truly want to hurt others in ways the other doesn’t want order to get ahead, and would resist any attempt to remove hurting others. given that these people can each have their own ai amplifier, what tool can be both probabilistically-verifiably trustable but also help both ai and human mutually discover ways to be aligned with others that neither could have discovered on their own?
I’ll be happy if AI gives people time/space/safety to figure out what they want while taking actions in the world that preserve option value.
The kind of AI alignment solution we’re working on isn’t a substitute for people deciding how they want to reflect and develop and decide what they value. The idea is that if AI is going to be part of that process, then the timing and nature of AI involvement should be decided by people rather than by “we need to deploy this AI now in order to remain competitive and accept whatever affects that has on our values.”
You could imagine AI solutions that try to replace the normal process of moral deliberation and reconciliation (rather than simply being a tool to help it), but I’ve never seen a proposal along those lines that didn’t seem really bad to me.