Thanks for the comment! I do think that, at present, the only working example we have of an agent able explicitly self-inspect its own values is in the human case, even if getting the base shards ‘right’ in the prosocial sense would likely entail that they will already be doing self-reflection. Am I misunderstanding your point here?
Thanks for the comment! I do think that, at present, the only working example we have of an agent able explicitly self-inspect its own values is in the human case, even if getting the base shards ‘right’ in the prosocial sense would likely entail that they will already be doing self-reflection. Am I misunderstanding your point here?