Yes. But the whole point of the alignment effort is to look into the future, rather than have us run it over because we weren’t certain what would happen and so didn’t bother to make any plans for different things that would happen.
Yeah, I get that. But to look into the future one must take stock of the past and present and reevaluate models that gave wrong predictions. I am yet to see this happening.
The idea that agentiness is an advantage does not predict that there will never be an improvement made in other ways.
It predicts that we’ll add agentiness to those improvements. We are busy doing that. It will prove advantageous to some degree we don’t know yet, maybe huge, maybe so tiny it’s essentially not used. But that’s only in the very near term. The same arguments will keep on applying forever, if they’re correct.
WRT your comment that we don’t have a handle on values or drives, I think that’s flat wrong. We have good models in humans and AI. My post Human preferences as RL critic values—implications for alignment lays out the human side and one model for AI. But providing goals in natural language for a language model agent is another easy route to adding a functional analogue of values.
I will continue for now to focus my alignment efforts on futures where AGI is agentic, because those seem like the dangerous ones, and I have yet to hear any plausible future in which we thoroughly stick to tool AI and don’t agentize it at some point.
Edit: Thinking about this a little more, I do see one plausible future in which we don’t agentize tool AI: one with a “pivotal act” that makes creating it impossible, probably involving powerful tool AI. In that future, the key bit is human motivations, which I think of as the societal alignment problem. That needs to be addressed to get alignment solutions implemented, so these two futures are addressed by the same work.
Yes. But the whole point of the alignment effort is to look into the future, rather than have us run it over because we weren’t certain what would happen and so didn’t bother to make any plans for different things that would happen.
Yeah, I get that. But to look into the future one must take stock of the past and present and reevaluate models that gave wrong predictions. I am yet to see this happening.
The idea that agentiness is an advantage does not predict that there will never be an improvement made in other ways.
It predicts that we’ll add agentiness to those improvements. We are busy doing that. It will prove advantageous to some degree we don’t know yet, maybe huge, maybe so tiny it’s essentially not used. But that’s only in the very near term. The same arguments will keep on applying forever, if they’re correct.
WRT your comment that we don’t have a handle on values or drives, I think that’s flat wrong. We have good models in humans and AI. My post Human preferences as RL critic values—implications for alignment lays out the human side and one model for AI. But providing goals in natural language for a language model agent is another easy route to adding a functional analogue of values.
I will continue for now to focus my alignment efforts on futures where AGI is agentic, because those seem like the dangerous ones, and I have yet to hear any plausible future in which we thoroughly stick to tool AI and don’t agentize it at some point.
Edit: Thinking about this a little more, I do see one plausible future in which we don’t agentize tool AI: one with a “pivotal act” that makes creating it impossible, probably involving powerful tool AI. In that future, the key bit is human motivations, which I think of as the societal alignment problem. That needs to be addressed to get alignment solutions implemented, so these two futures are addressed by the same work.