I’m not especially familiar with all the literature involved here, so forgive me if this is somehow repetitive.
However, I was wondering if having two lists might be more preferable. Naturally, there would be non-whitelisted objects (do not interfere with these in any way). Second, there could be objects which are fine to manipulate but must retain functional integrity (for instance, a book can be freely manipulated under most circumstances; however, it cannot be moved so it becomes out of reach or illegible, and should not be moved or obstructed while in use). Third, of course, would be objects with “full permissions”, such as, potentially, the paint on the aforementioned tiles.
The main difficulty here is that definitions for functional integrity would have to be either written or learned for virtually every function, though I suspect it would be (relatively) easy enough to recognise novel objects and their functions thereafter. Of course, there could also be some sort of machine-readable identification added to common objects which carries information on their functions, though whether this would only refer to predefined classes (books, bicycles, vases) or also be able to contain instructions on a new function type (potentially a useful feature for new inventions and similar) is a separate question.
> Important detail: the whitelist is only with respect to transitions between objects, not the object themselves!
I understand the technical and semantic distinction here, but I’m not sure I understand the practical one, when it comes to actual behaviour and results. Is there a situation you have in mind where the two approaches would be notably different in outcome?
> Something I’m not sure about is whether the described dissimilarity will map up with our intuitive notions of dissimilarity. I think it’s doable, whether via my formulation or some other one.
Well, there’s also the issue that there are different opinions on different sorts of transitions between actual, living humans. There will probably never be an end to the arguments over whether graffiti is art or vandalism, for example. Dissimilarities between average human and average non-human notions should probably be expected, to some extent; perhaps even beneficial, assuming alignment goes well enough otherwise.
> “Out of reach” is indexical, and it’s not clear how (and whether) to even have whitelisting penalize displacing objects. Stasis notwithstanding, many misgivings we might have about an agent being able to move objects at its leisure should go away if we can say that these movements don’t lead to non-whitelisted transitions (e.g., putting unshielded people in space would certainly lead to penalized transitions).
Good point. Though, it’s possible to imagine displacement becoming such a transition without the harm being so overt. As an example, even humans are prone (if usually by accident) to dropping or throwing objects in such a way as to make their retrieval difficult or, in some cases, effectively impossible; a non-human agent, I think, should take care to avoid making the same mistake, where not necessary.
>If we get into functional values, that’s elevating the complexity from “avoid doing unknown things to unknown objects” to “learn what to do with each object”. We aren’t trying to build an entire utility function—we’re trying to build a sturdy, conservative convex hull, and it’s OK if we miss out on some details.
My intent in bringing it up was less, “simple whitelisting is too restrictive,” and more, “maybe this would allow for a lesser number of lost opportunities while still coming fairly close to ensuring that things which are both unregistered and unrecognisable (by the agent in question) would not suffer an unfavourable transition.”
In other words, it’s less a replacement for the concept of whitelisting and more of a possible way of limiting its potential downsides. Of course, it would need to be implemented carefully, or else the benefits of whitelisting could also easily be lost, at least in part...
> I have a heuristic that says that the more pieces a solution has, the less likely it is to work.
While true, this reminds me of the Simple Poker series. The solution described in the second entry there was quite complicated (certainly much moreso than the Nash equilibrium), but also quite successful (including, apparently, against Nash equilibrium opponents).
Additional pieces can make failure more likely, but too much simplicity can preclude success.
>I think if we can get it robustly recognizing objects in its model and then projecting them into latent space, that would suffice.
True, though there are many cases in which this doesn’t work so well. For a more practical and serious example, a fair number of people need to wear alert tags of some sort which identify certain medical conditions or sensitivities, or else they could be inadvertently killed by paramedics or ER treatment. Road signs and various sorts of notice also exist to fulfill similar purposes for humans.
While it would be more than possible to have a non-human agent able to read the information in such cases, written text is a form of information transmission designed and optimised for human visual processing, and comes with numerous drawbacks, including a distinct possibility that the written information is not noticed altogether, and these are things a machine-specific form of ‘tagging’ could likely easily bypass.
It’s hardly the first line solution, of course.