The questions there would be more like “what sequence of reward events will reinforce the desired shards of value within the AI?” and not “how do we philosophically do some fancy framework so that the agent doesn’t end up hacking its sensors or maximizing the quotation of our values?”.
I think that it generally seems like a good idea to have solid theories of two different things:
What is the thing we are hoping to teach the AI?
What is the training story by which we mean to teach it?
I read your above paragraph as maligning (1) in favor of (2). In order to reinforce the desired shards, it seems helpful to have some idea of what those look like.
For example, if we avoid fancy philosophical frameworks, we might think a good way to avoid wireheading is to introduce negative examples where the AI manipulates circuitry to boost reinforcement signals, and positive examples where the AI doesn’t do that when given the opportunity. After doing some philosophy where you try to positively specify what you’re trying to train, it’s easier to notice that this sort of training still leaves the human-manipulation failure mode open.
After doing this kind of philosophy for a while, it’s intuitive to form the more general prediction that if you haven’t been able to write down a formal model of the kind of thing you’re trying to teach, there are probably easy failure modes like this which your training hasn’t attempted to rule out at all.
I think that it generally seems like a good idea to have solid theories of two different things:
What is the thing we are hoping to teach the AI?
What is the training story by which we mean to teach it?
I read your above paragraph as maligning (1) in favor of (2). In order to reinforce the desired shards, it seems helpful to have some idea of what those look like.
For example, if we avoid fancy philosophical frameworks, we might think a good way to avoid wireheading is to introduce negative examples where the AI manipulates circuitry to boost reinforcement signals, and positive examples where the AI doesn’t do that when given the opportunity. After doing some philosophy where you try to positively specify what you’re trying to train, it’s easier to notice that this sort of training still leaves the human-manipulation failure mode open.
After doing this kind of philosophy for a while, it’s intuitive to form the more general prediction that if you haven’t been able to write down a formal model of the kind of thing you’re trying to teach, there are probably easy failure modes like this which your training hasn’t attempted to rule out at all.