Is there a situation you have in mind where the two approaches would be notably different in outcome?
Can you clarify what you mean by whitelisting objects? Would we only be OK with certain things existing, or coming into existence (i.e., whitelisting an object effectively whitelists all means of getting to that object), or something else entirely?
As an example, even humans are prone (if usually by accident) to dropping or throwing objects in such a way as to make their retrieval difficult or, in some cases, effectively impossible; a non-human agent, I think, should take care to avoid making the same mistake, where not necessary.
I hadn’t thought of this, actually! So, part of me wants to pass this off to the utility function also caring about not imposing retrieval costs on itself, because if it isn’t aligned enough to somewhat care about the things we do, we might be out of luck. That is, whitelisting isn’t sufficient to align a wholly unaligned agent—just to make states we don’t want harder to reach. If it has values orthogonal to ours, misplaced items might be the least of our concerns. Again, I think this is a valid consideration, and I’m going to think about it more!
The solution described in the second entry there was quite complicated (certainly much moreso than the Nash equilibrium), but also quite successful (including, apparently, against Nash equilibrium opponents).
Certainly more complex solutions can do better, but I imagine that the work required to formally verify an aligned system is a quadratic function of how many moving parts there are (that is, part n must play nice with all n−1 previous parts).
maybe this would allow for a lesser number of lost opportunities while still coming fairly close to ensuring that things which are both unregistered and unrecognisable (by the agent in question) would not suffer an unfavourable transition.
My current thoughts are that a rich enough latent space should also pick up unknown objects and their shifts, but this would need testing. Also, wouldn’t it be more likely that the wrong functions are extrapolated for new objects and we end up missing out on even more opportunities?
Can you clarify what you mean by whitelisting objects? Would we only be OK with certain things existing, or coming into existence (i.e., whitelisting an object effectively whitelists all means of getting to that object), or something else entirely?
I hadn’t thought of this, actually! So, part of me wants to pass this off to the utility function also caring about not imposing retrieval costs on itself, because if it isn’t aligned enough to somewhat care about the things we do, we might be out of luck. That is, whitelisting isn’t sufficient to align a wholly unaligned agent—just to make states we don’t want harder to reach. If it has values orthogonal to ours, misplaced items might be the least of our concerns. Again, I think this is a valid consideration, and I’m going to think about it more!
Certainly more complex solutions can do better, but I imagine that the work required to formally verify an aligned system is a quadratic function of how many moving parts there are (that is, part n must play nice with all n−1 previous parts).
My current thoughts are that a rich enough latent space should also pick up unknown objects and their shifts, but this would need testing. Also, wouldn’t it be more likely that the wrong functions are extrapolated for new objects and we end up missing out on even more opportunities?