>> I’ve been trying to understand and express why I find natural language alignment … so much more promising >> than any other alignment techniques I’ve found.
Could it be that we humans have millennia of experience aligning our new humans (children) using this method? Whereas every other method is entirely new to us, and has never been applied to a GI even if it has been tested on other AI systems; thus, predictions of outcomes are speculative.
But it still seems like there is something missing from specifying goals directly via expression through language or even representational manipulation. If the representations themselves do not contain any reference to motivational structure (i.e., they are “value free” representations), then the goals will not be particularly stable. Johnny knows that it’s bad to hit his friends because Mommy told him so, but he only cares because it’s Mommy who told him, and he has a rather strong psychological attachment to Mommy.,
I wouldn’t say this is the method we use to align children, for the reaon you point out: we can’t set the motivational valence of the goals we suggest. So I’d call that “’goal suggestion”. The difference in this method is that we are setting the goal value of that representation directly, editing the AGIs weights to do this in a way we can’t with children. It would be like when I say “it’s bad to hit people” I also set the weights into and through the amygdala so that the concept he represents, hitting people, is tied to a very negative reward prediction. That steers his actions away from hitting people.
By selecting a representation, then editing how it connects to a steering subsystem (like the human dopamine system), we are selecting it as a goal directly, not just suggesting it and allowing the system to set its own valance (goal/avoidance marker) for that representation, as we do with human children.
>> I’ve been trying to understand and express why I find natural language alignment … so much more promising >> than any other alignment techniques I’ve found.
Could it be that we humans have millennia of experience aligning our new humans (children) using this method? Whereas every other method is entirely new to us, and has never been applied to a GI even if it has been tested on other AI systems; thus, predictions of outcomes are speculative.
But it still seems like there is something missing from specifying goals directly via expression through language or even representational manipulation. If the representations themselves do not contain any reference to motivational structure (i.e., they are “value free” representations), then the goals will not be particularly stable. Johnny knows that it’s bad to hit his friends because Mommy told him so, but he only cares because it’s Mommy who told him, and he has a rather strong psychological attachment to Mommy.,
I wouldn’t say this is the method we use to align children, for the reaon you point out: we can’t set the motivational valence of the goals we suggest. So I’d call that “’goal suggestion”. The difference in this method is that we are setting the goal value of that representation directly, editing the AGIs weights to do this in a way we can’t with children. It would be like when I say “it’s bad to hit people” I also set the weights into and through the amygdala so that the concept he represents, hitting people, is tied to a very negative reward prediction. That steers his actions away from hitting people.
By selecting a representation, then editing how it connects to a steering subsystem (like the human dopamine system), we are selecting it as a goal directly, not just suggesting it and allowing the system to set its own valance (goal/avoidance marker) for that representation, as we do with human children.