I’m not especially familiar with all the literature involved here, so forgive me if this is somehow repetitive.
However, I was wondering if having two lists might be more preferable. Naturally, there would be non-whitelisted objects (do not interfere with these in any way). Second, there could be objects which are fine to manipulate but must retain functional integrity (for instance, a book can be freely manipulated under most circumstances; however, it cannot be moved so it becomes out of reach or illegible, and should not be moved or obstructed while in use). Third, of course, would be objects with “full permissions”, such as, potentially, the paint on the aforementioned tiles.
The main difficulty here is that definitions for functional integrity would have to be either written or learned for virtually every function, though I suspect it would be (relatively) easy enough to recognise novel objects and their functions thereafter. Of course, there could also be some sort of machine-readable identification added to common objects which carries information on their functions, though whether this would only refer to predefined classes (books, bicycles, vases) or also be able to contain instructions on a new function type (potentially a useful feature for new inventions and similar) is a separate question.
Important detail: the whitelist is only with respect to transitions between objects, not the object themselves!
Second, there could be objects which are fine to manipulate but must retain functional integrity (for instance, a book can be freely manipulated under most circumstances; however, it cannot be moved so it becomes out of reach or illegible, and should not be moved or obstructed while in use).
“Out of reach” is indexical, and it’s not clear how (and whether) to even have whitelisting penalize displacing objects. Stasis notwithstanding, many misgivings we might have about an agent being able to move objects at its leisure should go away if we can say that these movements don’t lead to non-whitelisted transitions (e.g., putting unshielded people in space would certainly lead to penalized transitions).
Third, of course, would be objects with “full permissions”, such as, potentially, the paint on the aforementioned tiles.
I think that latent space whitelisting actually captures this kind of permissions-based granularity. As I imagine it, a descriptive latent space would act as an approximation to thingspace. Something I’m not sure about is whether the described dissimilarity will map up with our intuitive notions of dissimilarity. I think it’s doable, whether via my formulation or some other one.
The main difficulty here is that definitions for functional integrity would have to be either written or learned for virtually every function, though I suspect it would be (relatively) easy enough to recognise novel objects and their functions thereafter.
One of the roles I see whitelisting attempting to fill is that of tracing a conservative convex hull inside the outcome space, locking us out of some good possibilities but (hopefully) many more bad ones. If we get into functional values, that’s elevating the complexity from “avoid doing unknown things to unknown objects” to “learn what to do with each object”. We aren’t trying to build an entire utility function—we’re trying to build a sturdy, conservative convex hull, and it’s OK if we miss out on some details. I have a heuristic that says that the more pieces a solution has, the less likely it is to work.
Of course, there could also be some sort of machine-readable identification added to common objects which carries information on their functions, though whether this would only refer to predefined classes (books, bicycles, vases) or also be able to contain instructions on a new function type (potentially a useful feature for new inventions and similar) is a separate question.
This post’s discussion implicitly focuses on how whitelisting interacts with more advanced agents for whom we probably wouldn’t need to flag things like this. I think if we can get it robustly recognizing objects in its model and then projecting them into a latent space, that would suffice.
> Important detail: the whitelist is only with respect to transitions between objects, not the object themselves!
I understand the technical and semantic distinction here, but I’m not sure I understand the practical one, when it comes to actual behaviour and results. Is there a situation you have in mind where the two approaches would be notably different in outcome?
> Something I’m not sure about is whether the described dissimilarity will map up with our intuitive notions of dissimilarity. I think it’s doable, whether via my formulation or some other one.
Well, there’s also the issue that there are different opinions on different sorts of transitions between actual, living humans. There will probably never be an end to the arguments over whether graffiti is art or vandalism, for example. Dissimilarities between average human and average non-human notions should probably be expected, to some extent; perhaps even beneficial, assuming alignment goes well enough otherwise.
> “Out of reach” is indexical, and it’s not clear how (and whether) to even have whitelisting penalize displacing objects. Stasis notwithstanding, many misgivings we might have about an agent being able to move objects at its leisure should go away if we can say that these movements don’t lead to non-whitelisted transitions (e.g., putting unshielded people in space would certainly lead to penalized transitions).
Good point. Though, it’s possible to imagine displacement becoming such a transition without the harm being so overt. As an example, even humans are prone (if usually by accident) to dropping or throwing objects in such a way as to make their retrieval difficult or, in some cases, effectively impossible; a non-human agent, I think, should take care to avoid making the same mistake, where not necessary.
>If we get into functional values, that’s elevating the complexity from “avoid doing unknown things to unknown objects” to “learn what to do with each object”. We aren’t trying to build an entire utility function—we’re trying to build a sturdy, conservative convex hull, and it’s OK if we miss out on some details.
My intent in bringing it up was less, “simple whitelisting is too restrictive,” and more, “maybe this would allow for a lesser number of lost opportunities while still coming fairly close to ensuring that things which are both unregistered and unrecognisable (by the agent in question) would not suffer an unfavourable transition.”
In other words, it’s less a replacement for the concept of whitelisting and more of a possible way of limiting its potential downsides. Of course, it would need to be implemented carefully, or else the benefits of whitelisting could also easily be lost, at least in part...
> I have a heuristic that says that the more pieces a solution has, the less likely it is to work.
While true, this reminds me of the Simple Poker series. The solution described in the second entry there was quite complicated (certainly much moreso than the Nash equilibrium), but also quite successful (including, apparently, against Nash equilibrium opponents).
Additional pieces can make failure more likely, but too much simplicity can preclude success.
>I think if we can get it robustly recognizing objects in its model and then projecting them into latent space, that would suffice.
True, though there are many cases in which this doesn’t work so well. For a more practical and serious example, a fair number of people need to wear alert tags of some sort which identify certain medical conditions or sensitivities, or else they could be inadvertently killed by paramedics or ER treatment. Road signs and various sorts of notice also exist to fulfill similar purposes for humans.
While it would be more than possible to have a non-human agent able to read the information in such cases, written text is a form of information transmission designed and optimised for human visual processing, and comes with numerous drawbacks, including a distinct possibility that the written information is not noticed altogether, and these are things a machine-specific form of ‘tagging’ could likely easily bypass.
Is there a situation you have in mind where the two approaches would be notably different in outcome?
Can you clarify what you mean by whitelisting objects? Would we only be OK with certain things existing, or coming into existence (i.e., whitelisting an object effectively whitelists all means of getting to that object), or something else entirely?
As an example, even humans are prone (if usually by accident) to dropping or throwing objects in such a way as to make their retrieval difficult or, in some cases, effectively impossible; a non-human agent, I think, should take care to avoid making the same mistake, where not necessary.
I hadn’t thought of this, actually! So, part of me wants to pass this off to the utility function also caring about not imposing retrieval costs on itself, because if it isn’t aligned enough to somewhat care about the things we do, we might be out of luck. That is, whitelisting isn’t sufficient to align a wholly unaligned agent—just to make states we don’t want harder to reach. If it has values orthogonal to ours, misplaced items might be the least of our concerns. Again, I think this is a valid consideration, and I’m going to think about it more!
The solution described in the second entry there was quite complicated (certainly much moreso than the Nash equilibrium), but also quite successful (including, apparently, against Nash equilibrium opponents).
Certainly more complex solutions can do better, but I imagine that the work required to formally verify an aligned system is a quadratic function of how many moving parts there are (that is, part n must play nice with all n−1 previous parts).
maybe this would allow for a lesser number of lost opportunities while still coming fairly close to ensuring that things which are both unregistered and unrecognisable (by the agent in question) would not suffer an unfavourable transition.
My current thoughts are that a rich enough latent space should also pick up unknown objects and their shifts, but this would need testing. Also, wouldn’t it be more likely that the wrong functions are extrapolated for new objects and we end up missing out on even more opportunities?
I’m not especially familiar with all the literature involved here, so forgive me if this is somehow repetitive.
However, I was wondering if having two lists might be more preferable. Naturally, there would be non-whitelisted objects (do not interfere with these in any way). Second, there could be objects which are fine to manipulate but must retain functional integrity (for instance, a book can be freely manipulated under most circumstances; however, it cannot be moved so it becomes out of reach or illegible, and should not be moved or obstructed while in use). Third, of course, would be objects with “full permissions”, such as, potentially, the paint on the aforementioned tiles.
The main difficulty here is that definitions for functional integrity would have to be either written or learned for virtually every function, though I suspect it would be (relatively) easy enough to recognise novel objects and their functions thereafter. Of course, there could also be some sort of machine-readable identification added to common objects which carries information on their functions, though whether this would only refer to predefined classes (books, bicycles, vases) or also be able to contain instructions on a new function type (potentially a useful feature for new inventions and similar) is a separate question.
Hey, thanks for the ideas!
Important detail: the whitelist is only with respect to transitions between objects, not the object themselves!
“Out of reach” is indexical, and it’s not clear how (and whether) to even have whitelisting penalize displacing objects. Stasis notwithstanding, many misgivings we might have about an agent being able to move objects at its leisure should go away if we can say that these movements don’t lead to non-whitelisted transitions (e.g., putting unshielded people in space would certainly lead to penalized transitions).
I think that latent space whitelisting actually captures this kind of permissions-based granularity. As I imagine it, a descriptive latent space would act as an approximation to thingspace. Something I’m not sure about is whether the described dissimilarity will map up with our intuitive notions of dissimilarity. I think it’s doable, whether via my formulation or some other one.
One of the roles I see whitelisting attempting to fill is that of tracing a conservative convex hull inside the outcome space, locking us out of some good possibilities but (hopefully) many more bad ones. If we get into functional values, that’s elevating the complexity from “avoid doing unknown things to unknown objects” to “learn what to do with each object”. We aren’t trying to build an entire utility function—we’re trying to build a sturdy, conservative convex hull, and it’s OK if we miss out on some details. I have a heuristic that says that the more pieces a solution has, the less likely it is to work.
This post’s discussion implicitly focuses on how whitelisting interacts with more advanced agents for whom we probably wouldn’t need to flag things like this. I think if we can get it robustly recognizing objects in its model and then projecting them into a latent space, that would suffice.
> Important detail: the whitelist is only with respect to transitions between objects, not the object themselves!
I understand the technical and semantic distinction here, but I’m not sure I understand the practical one, when it comes to actual behaviour and results. Is there a situation you have in mind where the two approaches would be notably different in outcome?
> Something I’m not sure about is whether the described dissimilarity will map up with our intuitive notions of dissimilarity. I think it’s doable, whether via my formulation or some other one.
Well, there’s also the issue that there are different opinions on different sorts of transitions between actual, living humans. There will probably never be an end to the arguments over whether graffiti is art or vandalism, for example. Dissimilarities between average human and average non-human notions should probably be expected, to some extent; perhaps even beneficial, assuming alignment goes well enough otherwise.
> “Out of reach” is indexical, and it’s not clear how (and whether) to even have whitelisting penalize displacing objects. Stasis notwithstanding, many misgivings we might have about an agent being able to move objects at its leisure should go away if we can say that these movements don’t lead to non-whitelisted transitions (e.g., putting unshielded people in space would certainly lead to penalized transitions).
Good point. Though, it’s possible to imagine displacement becoming such a transition without the harm being so overt. As an example, even humans are prone (if usually by accident) to dropping or throwing objects in such a way as to make their retrieval difficult or, in some cases, effectively impossible; a non-human agent, I think, should take care to avoid making the same mistake, where not necessary.
>If we get into functional values, that’s elevating the complexity from “avoid doing unknown things to unknown objects” to “learn what to do with each object”. We aren’t trying to build an entire utility function—we’re trying to build a sturdy, conservative convex hull, and it’s OK if we miss out on some details.
My intent in bringing it up was less, “simple whitelisting is too restrictive,” and more, “maybe this would allow for a lesser number of lost opportunities while still coming fairly close to ensuring that things which are both unregistered and unrecognisable (by the agent in question) would not suffer an unfavourable transition.”
In other words, it’s less a replacement for the concept of whitelisting and more of a possible way of limiting its potential downsides. Of course, it would need to be implemented carefully, or else the benefits of whitelisting could also easily be lost, at least in part...
> I have a heuristic that says that the more pieces a solution has, the less likely it is to work.
While true, this reminds me of the Simple Poker series. The solution described in the second entry there was quite complicated (certainly much moreso than the Nash equilibrium), but also quite successful (including, apparently, against Nash equilibrium opponents).
Additional pieces can make failure more likely, but too much simplicity can preclude success.
>I think if we can get it robustly recognizing objects in its model and then projecting them into latent space, that would suffice.
True, though there are many cases in which this doesn’t work so well. For a more practical and serious example, a fair number of people need to wear alert tags of some sort which identify certain medical conditions or sensitivities, or else they could be inadvertently killed by paramedics or ER treatment. Road signs and various sorts of notice also exist to fulfill similar purposes for humans.
While it would be more than possible to have a non-human agent able to read the information in such cases, written text is a form of information transmission designed and optimised for human visual processing, and comes with numerous drawbacks, including a distinct possibility that the written information is not noticed altogether, and these are things a machine-specific form of ‘tagging’ could likely easily bypass.
It’s hardly the first line solution, of course.
Can you clarify what you mean by whitelisting objects? Would we only be OK with certain things existing, or coming into existence (i.e., whitelisting an object effectively whitelists all means of getting to that object), or something else entirely?
I hadn’t thought of this, actually! So, part of me wants to pass this off to the utility function also caring about not imposing retrieval costs on itself, because if it isn’t aligned enough to somewhat care about the things we do, we might be out of luck. That is, whitelisting isn’t sufficient to align a wholly unaligned agent—just to make states we don’t want harder to reach. If it has values orthogonal to ours, misplaced items might be the least of our concerns. Again, I think this is a valid consideration, and I’m going to think about it more!
Certainly more complex solutions can do better, but I imagine that the work required to formally verify an aligned system is a quadratic function of how many moving parts there are (that is, part n must play nice with all n−1 previous parts).
My current thoughts are that a rich enough latent space should also pick up unknown objects and their shifts, but this would need testing. Also, wouldn’t it be more likely that the wrong functions are extrapolated for new objects and we end up missing out on even more opportunities?