. If you can give them a fun problem to solve and make sure it’s actually relevant and they are only rewarded for actually relevant work, then good research could still be produced.
Yeah I think the difficulty of setting this up correctly is the main crux. I’m quite uncertain on this, but I’ll give the argument my model of John Wentworth makes against this:
The Trojan detection competition it does seem roughly similar to deception, and if you can find Trojan’s really well, it’s plausible that you can find deceptive alignment. However, what we really need is a way to exert optimization pressure away from deceptive regions of parameter space. And right now, afaik, we have no idea how strongly deception is favored.
I can imagine using methods from this competition to put a small amount of pressure away from this, by, e.g., restarting whenever you see deception, or running SGD on your interpreted deception. But this feels sketchy because 1) you are putting pressure on these tools, and you might just steer into regions of space where they fail, and 2) you are training a model until it becomes deceptive: eventually, a smart deceptive model will be actively trying to beat these tools.
So what I really want is understanding the generators of deceptive alignment, which could take the form of formal version of the argument given here, so that I can prevent entering the deceptive regions of parameter space in the first place.
Relevant research has been produced by the ML community before by people who weren’t explicitly thinking about x-risk (mostly “accidentally”, i.e. not because anyone who cared about x-risk told them/incentivized them to, but hopefully this will change).
Could you link an example? I am curious what you have in mind. I’m guessing something like the ROME paper?
Thoughts on John’s comment: this is a problem with any method for detecting deception that isn’t 100% accurate. I agree that finding a 100% accurate method would be nice, but good luck.
Also, you can somewhat get around this by holding some deception detecting methods out (i.e. not optimizing against them). When you finish training and the held out methods tell you that your AI is deceptive, you start over. Then you have to try to think of another approach that is more likely to actually discourage deception than fool your held out detectors. This is the difference between gradient descent search and human design search, which I think is an important distinction.
Also, FWIW, I doubt that trojans are currently a good microcosm for detecting deception. Right now, it is too easy to search for the trigger using brute force optimization. If you ported this over to sequential-decision-making land where triggers can be long and complicated, that would help a lot. I see a lot of current trojan detection research as laying the groundwork for future research that will be more relevant.
In general, it seems better to me to evaluate research by asking “where is this taking the field/what follow-up research is this motivating?” rather than “how are the words in this paper directly useful if we had to build AGI right now?” Eventually, the second one is what matters, but until we have systems that look more like agents that plan and achieve goals in the real world, I’m pretty skeptical of a lot of the direct value of empirical research.
Yeah I think the difficulty of setting this up correctly is the main crux. I’m quite uncertain on this, but I’ll give the argument my model of John Wentworth makes against this:
The Trojan detection competition it does seem roughly similar to deception, and if you can find Trojan’s really well, it’s plausible that you can find deceptive alignment. However, what we really need is a way to exert optimization pressure away from deceptive regions of parameter space. And right now, afaik, we have no idea how strongly deception is favored.
I can imagine using methods from this competition to put a small amount of pressure away from this, by, e.g., restarting whenever you see deception, or running SGD on your interpreted deception. But this feels sketchy because 1) you are putting pressure on these tools, and you might just steer into regions of space where they fail, and 2) you are training a model until it becomes deceptive: eventually, a smart deceptive model will be actively trying to beat these tools.
So what I really want is understanding the generators of deceptive alignment, which could take the form of formal version of the argument given here, so that I can prevent entering the deceptive regions of parameter space in the first place.
Could you link an example? I am curious what you have in mind. I’m guessing something like the ROME paper?
Thoughts on John’s comment: this is a problem with any method for detecting deception that isn’t 100% accurate. I agree that finding a 100% accurate method would be nice, but good luck.
Also, you can somewhat get around this by holding some deception detecting methods out (i.e. not optimizing against them). When you finish training and the held out methods tell you that your AI is deceptive, you start over. Then you have to try to think of another approach that is more likely to actually discourage deception than fool your held out detectors. This is the difference between gradient descent search and human design search, which I think is an important distinction.
Also, FWIW, I doubt that trojans are currently a good microcosm for detecting deception. Right now, it is too easy to search for the trigger using brute force optimization. If you ported this over to sequential-decision-making land where triggers can be long and complicated, that would help a lot. I see a lot of current trojan detection research as laying the groundwork for future research that will be more relevant.
In general, it seems better to me to evaluate research by asking “where is this taking the field/what follow-up research is this motivating?” rather than “how are the words in this paper directly useful if we had to build AGI right now?” Eventually, the second one is what matters, but until we have systems that look more like agents that plan and achieve goals in the real world, I’m pretty skeptical of a lot of the direct value of empirical research.