I agree preferences aren’t reducible to actual behavior. But I think they are reducible to dispositions to behave, i.e., behavior across counterfactual worlds. If a system prefers a specific event Z, that means that, across counterfactual environments you could have put it in, the future would on average have had more Z the more its specific distinguishing features had a large and direct causal impact on the world.
The examples I used seem to apply to “dispositions” to behave, in the same way (I wasn’t making this distinction). There are settings where the goal can’t be clearly inferred from behavior, or collection of hypothetical behaviors in response to various environments, at least if we keep environments relatively close to what might naturally occur, even as in those settings the goal can be observed “directly” (defined as an idealization based in AI’s design).
An AI with encypted goal (i.e. the AI itself doesn’t know the goal in explicit form, but the goal can be abstractly defined as the result of decryption) won’t behave in accordance with it in any environment that doesn’t magically let it decrypt its goal quickly, there is no tendency to push the events towards what the encrypted goal specifies, until the goal is decrypted (which might be never with high probability).
I don’t think a sufficiently well-encrypted ‘preference’ should be counted as a preference for present purposes. In principle, you can treat any physical chunk of matter as an ‘encrypted preference’, because if the AI just were a key of exactly the right shape, then it could physically interact with the lock in question to acquire a new optimization target. But if neither the AI nor anything very similar to the AI in nearby possible worlds actually acts as a key of the requisite sort, then we should treat the parts of the world that a distant AI could interact with to acquire a preference as, in our world, mere window dressing.
Perhaps if we actually built a bunch of AIs, and one of them was just like the others except where others of its kind had a preference module, it had a copy of The Wind in the Willows, we would speak of this new AI as having an ‘encrypted preference’ consisting of a book, with no easy way to treat that book as a decision criterion like its brother- and sister-AIs do for their homologous components. But I don’t see any reason right now to make our real-world usage of the word ‘preference’ correspond to that possible world’s usage. It’s too many levels of abstraction away from what we should be worried about, which are the actual real-world effects different AI architectures would have.
I agree preferences aren’t reducible to actual behavior. But I think they are reducible to dispositions to behave, i.e., behavior across counterfactual worlds. If a system prefers a specific event Z, that means that, across counterfactual environments you could have put it in, the future would on average have had more Z the more its specific distinguishing features had a large and direct causal impact on the world.
The examples I used seem to apply to “dispositions” to behave, in the same way (I wasn’t making this distinction). There are settings where the goal can’t be clearly inferred from behavior, or collection of hypothetical behaviors in response to various environments, at least if we keep environments relatively close to what might naturally occur, even as in those settings the goal can be observed “directly” (defined as an idealization based in AI’s design).
An AI with encypted goal (i.e. the AI itself doesn’t know the goal in explicit form, but the goal can be abstractly defined as the result of decryption) won’t behave in accordance with it in any environment that doesn’t magically let it decrypt its goal quickly, there is no tendency to push the events towards what the encrypted goal specifies, until the goal is decrypted (which might be never with high probability).
I don’t think a sufficiently well-encrypted ‘preference’ should be counted as a preference for present purposes. In principle, you can treat any physical chunk of matter as an ‘encrypted preference’, because if the AI just were a key of exactly the right shape, then it could physically interact with the lock in question to acquire a new optimization target. But if neither the AI nor anything very similar to the AI in nearby possible worlds actually acts as a key of the requisite sort, then we should treat the parts of the world that a distant AI could interact with to acquire a preference as, in our world, mere window dressing.
Perhaps if we actually built a bunch of AIs, and one of them was just like the others except where others of its kind had a preference module, it had a copy of The Wind in the Willows, we would speak of this new AI as having an ‘encrypted preference’ consisting of a book, with no easy way to treat that book as a decision criterion like its brother- and sister-AIs do for their homologous components. But I don’t see any reason right now to make our real-world usage of the word ‘preference’ correspond to that possible world’s usage. It’s too many levels of abstraction away from what we should be worried about, which are the actual real-world effects different AI architectures would have.