Will blackboxing the reward function help, either physically or cryptographically? It also should include the obscurity about the boundary between the BB and internal computations in AI, that is, the AI will not know which data actually trigger the BB reaction.
This is how human reward function seems to work. It is well protected from internal hacking: if I imagine that I got 100 USD, it will not create as much pleasure as in the situation when I am actually getting 100. When I send mental image of 100 USD into the my reward box, the box “knows” that I am lying and don’t generate the reward. As don’t know much about how the real human reward function works I have to get real 100 USD.
Will blackboxing the reward function help, either physically or cryptographically? It also should include the obscurity about the boundary between the BB and internal computations in AI, that is, the AI will not know which data actually trigger the BB reaction.
This is how human reward function seems to work. It is well protected from internal hacking: if I imagine that I got 100 USD, it will not create as much pleasure as in the situation when I am actually getting 100. When I send mental image of 100 USD into the my reward box, the box “knows” that I am lying and don’t generate the reward. As don’t know much about how the real human reward function works I have to get real 100 USD.