Models could incorporate long-term predictions, ensuring decisions align with future sustainability and impact goals.
Researchers implement mechanisms where AI models autonomously recognize misalignment, shut down harmful behaviors, or even self-destruct (via wiping out network weights) for severe cases.
We could build modular AI systems with distinct sub-components focusing on different objectives (e.g., overall goal, ethical considerations, social implications). These agents could check each other’s outputs, flagging potential high-risk conflicts or misalignment.
This feels like you either misunderstood the problem, or you respond by circular logic.
The problem with alignment is that we don’t know how to do alignment.
Your proposals:
do alignment with future sustainability
recognize misalignment
recognize other agent’s misalignment
I repeat: the problem is that we don’t know how to “do alignment” (or “recognize misalignment”).
*
As an analogy, imagine that someone tells you “I don’t know how to swim”, and your advice would be:
keep your head safely above the water
don’t drown
when swimming together with other people at the same skill level, check each other that you are not drowning
Well, if I knew how to keep my head safely above the water and how not to drown, I wouldn’t be asking the question in the first place.
This feels like you either misunderstood the problem, or you respond by circular logic.
The problem with alignment is that we don’t know how to do alignment.
Your proposals:
do alignment with future sustainability
recognize misalignment
recognize other agent’s misalignment
I repeat: the problem is that we don’t know how to “do alignment” (or “recognize misalignment”).
*
As an analogy, imagine that someone tells you “I don’t know how to swim”, and your advice would be:
keep your head safely above the water
don’t drown
when swimming together with other people at the same skill level, check each other that you are not drowning
Well, if I knew how to keep my head safely above the water and how not to drown, I wouldn’t be asking the question in the first place.