The basic idea behind compressed pointers is that you can have the abstract goal of cooperating with humans, without actually knowing very much about humans. [...] In machine-learning terms, this is the question of how to specify a loss function for the purpose of learning human values.
You said:
In machine-learning terms, this is the question of how to train an AI whose internal cognition reliably unfolds into caring about people, in whatever form that takes in the AI’s learned ontology (whether or not it has a concept for people).
Thinking about this now, I think maybe it’s a question of precautions, and what order you want to teach things in. Very similarly to the argument that you might want to make a system corrigible first, before ensuring that it has other good properties—because if you make a mistake, later, a corrigible system will let you correct the mistake.
Similarly, it seems like a sensible early goal could be ‘get the system to understand that the sort of thing it is trying to do, in (value) learning, is to pick up human values’. Because once it has understood this point correctly, it is harder for things to go wrong later on, and the system may even be able to do much of the heavy lifting for you.
Really, what makes me go to the meta-level like this is pessimism about the more direct approach. Directly trying to instill human values, rather than first training in a meta-level understanding of that task, doesn’t seem like a very correctible approach. (I think much of this pessimism comes from mentally visualizing humans arguing about what object-level values to try to teach an AI. Even if the humans are able to agree, I do not feel especially optimistic about their choices, even if they’re supposedly informed by neuroscience and not just moral philosophers.)
Really, what makes me go to the meta-level like this is pessimism about the more direct approach. Directly trying to instill human values, rather than first training in a meta-level understanding of that task, doesn’t seem like a very correctible approach.
True, but I’m also uncertain about the relative difficulty of relatively novel and exotic value-spreads like “I value doing the right thing by humans, where I’m uncertain about the referent of humans”, compared to “People should have lots of resources and be able to spend them freely and wisely in pursuit of their own purposes” (the latter being values that at least I do in fact have).
I said:
You said:
Thinking about this now, I think maybe it’s a question of precautions, and what order you want to teach things in. Very similarly to the argument that you might want to make a system corrigible first, before ensuring that it has other good properties—because if you make a mistake, later, a corrigible system will let you correct the mistake.
Similarly, it seems like a sensible early goal could be ‘get the system to understand that the sort of thing it is trying to do, in (value) learning, is to pick up human values’. Because once it has understood this point correctly, it is harder for things to go wrong later on, and the system may even be able to do much of the heavy lifting for you.
Really, what makes me go to the meta-level like this is pessimism about the more direct approach. Directly trying to instill human values, rather than first training in a meta-level understanding of that task, doesn’t seem like a very correctible approach. (I think much of this pessimism comes from mentally visualizing humans arguing about what object-level values to try to teach an AI. Even if the humans are able to agree, I do not feel especially optimistic about their choices, even if they’re supposedly informed by neuroscience and not just moral philosophers.)
True, but I’m also uncertain about the relative difficulty of relatively novel and exotic value-spreads like “I value doing the right thing by humans, where I’m uncertain about the referent of humans”, compared to “People should have lots of resources and be able to spend them freely and wisely in pursuit of their own purposes” (the latter being values that at least I do in fact have).