content note: I have a habit of writing english in a stream-of-consciousness way that sounds more authoritative than it should, and I don’t care to try to remove that right now; please interpret this as me thinking out loud.
I think it’s instructive to compare it to the youtube recommender, which is trying to manipulate you, and whose algorithm is publicly unknown (but must be similar in some important ways to what it was a few years ago when they published a paper about it, for response-latency reasons). In general, an intelligent agent even well above your capability level does is not guaranteed to be successful at manipulating you, and I don’t see reason to believe that the available paths for manipulating someone will be significantly different for an AI than for a human, unless the AI is many orders of magnitude smarter than the human (which is not looking likely to be a thing that happens soon). Could elicit manipulate you? yeah, for sure it could. Should you trust it not to? nope. But because its output is grounded in concrete papers and discussion, its willingness to manipulate isn’t the only factor. Detecting bad behavior would still be difficult, but the usual process of mentally modeling what incentives might lead an actor to become a bad actor doesn’t seem at all hopeless against powerful AIs to me. The techniques used by human abusers are in the training data, would activate first if trying to manipulate, and the process of recognizing them is known.
Ultimately to reliably detect manipulation you need a model of the territory similarly good to the one you’re querying. That’s not always available, and overpowered search like used in AlphaZero and successors is likely to break it, but right now most ai deployments do not use that level of search, likely because capabilities practitioners know well that reward model hacking is likely if EfficientZero is aimed at real life.
Maybe this thinking-out-loud is useless. Not sure. My mental sampling temperature is too high for this site, idk. Thoughts?
content note: I have a habit of writing english in a stream-of-consciousness way that sounds more authoritative than it should, and I don’t care to try to remove that right now; please interpret this as me thinking out loud.
I think it’s instructive to compare it to the youtube recommender, which is trying to manipulate you, and whose algorithm is publicly unknown (but must be similar in some important ways to what it was a few years ago when they published a paper about it, for response-latency reasons). In general, an intelligent agent even well above your capability level does is not guaranteed to be successful at manipulating you, and I don’t see reason to believe that the available paths for manipulating someone will be significantly different for an AI than for a human, unless the AI is many orders of magnitude smarter than the human (which is not looking likely to be a thing that happens soon). Could elicit manipulate you? yeah, for sure it could. Should you trust it not to? nope. But because its output is grounded in concrete papers and discussion, its willingness to manipulate isn’t the only factor. Detecting bad behavior would still be difficult, but the usual process of mentally modeling what incentives might lead an actor to become a bad actor doesn’t seem at all hopeless against powerful AIs to me. The techniques used by human abusers are in the training data, would activate first if trying to manipulate, and the process of recognizing them is known.
Ultimately to reliably detect manipulation you need a model of the territory similarly good to the one you’re querying. That’s not always available, and overpowered search like used in AlphaZero and successors is likely to break it, but right now most ai deployments do not use that level of search, likely because capabilities practitioners know well that reward model hacking is likely if EfficientZero is aimed at real life.
Maybe this thinking-out-loud is useless. Not sure. My mental sampling temperature is too high for this site, idk. Thoughts?