In Relaxed adversarial training for inner alignment, I argued that one way of mechanistically verifying an acceptability condition might be to split a model into a value-neutral piece (its optimization procedure) and a value-laden piece (its objective).
1. That summary might be useful as a TL:DR on that post, unless the description was only referencing what aspects of it are important for (the ideas you are advancing in) this post.
2. It seems like those would be hard to disentangle because it seems like a value piece only cares about the things that it values, and thus, its “value neutral piece” might be incomplete for other values—though this might depend on what you mean by “optimization procedure”.
This seems relevant to the connection between strategy stealing and objective impact.
That summary might be useful as a TL:DR on that post, unless the description was only referencing what aspects of it are important for (the ideas you are advancing in) this post.
The idea of splitting up a model into a value-neutral piece and a value-laden piece was only one of a large number of things I talked about in “Relaxed adversarial training.”
It seems like those would be hard to disentangle because it seems like a value piece only cares about the things that it values, and thus, its “value neutral piece” might be incomplete for other values—though this might depend on what you mean by “optimization procedure”.
It’s definitely the case that some optimization procedures work better for some values than for others, though I don’t think it’s that bad. What I mean by optimization process here is something like a concrete implementation of some decision procedure. Something like “predict what action will produce the largest value given some world model and take that action,” for example. The trick is just to as much as you can avoid cases where your optimization procedure systematically favors some values over others. For example, you don’t want it to be the case that your optimization procedure only works for very easy-to-specify values, but not for other things that we might care about. Or, alternatively, you don’t want your optimization process to be something like “train an RL agent and then use that” that might produce actions that privilege simple proxies rather than what you really want (the “forwarding the guarantee” problem).
1. That summary might be useful as a TL:DR on that post, unless the description was only referencing what aspects of it are important for (the ideas you are advancing in) this post.
2. It seems like those would be hard to disentangle because it seems like a value piece only cares about the things that it values, and thus, its “value neutral piece” might be incomplete for other values—though this might depend on what you mean by “optimization procedure”.
This seems relevant to the connection between strategy stealing and objective impact.
The idea of splitting up a model into a value-neutral piece and a value-laden piece was only one of a large number of things I talked about in “Relaxed adversarial training.”
It’s definitely the case that some optimization procedures work better for some values than for others, though I don’t think it’s that bad. What I mean by optimization process here is something like a concrete implementation of some decision procedure. Something like “predict what action will produce the largest value given some world model and take that action,” for example. The trick is just to as much as you can avoid cases where your optimization procedure systematically favors some values over others. For example, you don’t want it to be the case that your optimization procedure only works for very easy-to-specify values, but not for other things that we might care about. Or, alternatively, you don’t want your optimization process to be something like “train an RL agent and then use that” that might produce actions that privilege simple proxies rather than what you really want (the “forwarding the guarantee” problem).