IIUC, there are two noteworthy limitations of this line of work:
It is still fundamentally biased toward nonresponse, so if e.g. the steps to make a poison and an antidote are similar, it won’t tell you the antidote for fear of misuse (this is necessary to avoid clever malicious prompts)
It doesn’t give any confidence about behavior at edge cases (e.g. is it ethical to help plan an insurrection against an oppressive regime? Is it racist to give accurate information in a way that portrays some minority in a bad light)
It doesn’t give any confidence about behavior at edge cases (e.g. is it ethical to help plan an insurrection against an oppressive regime? Is it racist to give accurate information in a way that portrays some minority in a bad light)
If I was handling the edge cases, I’d probably want the solution to be philosophically conservative. In this case the solution should not depend much on whether moral realism is correct or wrong.
I think I disagree with large parts of that post, but even if I didn’t, I’m asking something slightly different. Philosophical conservativism seems to be asking “how do we get the right behaviors at the edges?” I’m asking “how do we get any particular behavior at the edges?”
One answer may be “You can’t, you need philosophical conservativism”, but I don’t buy that. It seems to me that a “constitution”, i.e. pure deontology, is a potential answer so long as the exponentially many interactions of the principles are analyzed (I don’t think it’s a good answer, all told), but if I understand this work, it doesn’t generalize that far (because there would need to be training examples for each class of edge cases, of which there are too many).
Basically, we should use the assumption that is most robust to being wrong. It would be easier if there were objective, mind independent rules of morality, called moral realism, but if that assumption is wrong, your solution can get manipulated.
So in practice, we shouldn’t try to base alignment plans on whether moral realism is correct. In other words I’d simply go with what values you have and solve the edge cases according to your values.
I feel like we’re talking past each other. I’m trying to point out the difficulty of “simply go with what values you have and solve the edge cases according to your values” as a learning problem: it is too high dimension, and you need too many case labels; part of the idea of the OP is to reduce the number of training cases required, and my question/suspicion is that it doesn’t doesn’t really help outside of the “easy” stuff.
IIUC, there are two noteworthy limitations of this line of work:
It is still fundamentally biased toward nonresponse, so if e.g. the steps to make a poison and an antidote are similar, it won’t tell you the antidote for fear of misuse (this is necessary to avoid clever malicious prompts)
It doesn’t give any confidence about behavior at edge cases (e.g. is it ethical to help plan an insurrection against an oppressive regime? Is it racist to give accurate information in a way that portrays some minority in a bad light)
Did I understand correctly?
If I was handling the edge cases, I’d probably want the solution to be philosophically conservative. In this case the solution should not depend much on whether moral realism is correct or wrong.
Here’s a link to philosophical conservatism:
https://www.lesswrong.com/posts/3r44dhh3uK7s9Pveq/rfc-philosophical-conservatism-in-ai-alignment-research
I think I disagree with large parts of that post, but even if I didn’t, I’m asking something slightly different. Philosophical conservativism seems to be asking “how do we get the right behaviors at the edges?” I’m asking “how do we get any particular behavior at the edges?”
One answer may be “You can’t, you need philosophical conservativism”, but I don’t buy that. It seems to me that a “constitution”, i.e. pure deontology, is a potential answer so long as the exponentially many interactions of the principles are analyzed (I don’t think it’s a good answer, all told), but if I understand this work, it doesn’t generalize that far (because there would need to be training examples for each class of edge cases, of which there are too many).
Basically, we should use the assumption that is most robust to being wrong. It would be easier if there were objective, mind independent rules of morality, called moral realism, but if that assumption is wrong, your solution can get manipulated.
So in practice, we shouldn’t try to base alignment plans on whether moral realism is correct. In other words I’d simply go with what values you have and solve the edge cases according to your values.
I feel like we’re talking past each other. I’m trying to point out the difficulty of “simply go with what values you have and solve the edge cases according to your values” as a learning problem: it is too high dimension, and you need too many case labels; part of the idea of the OP is to reduce the number of training cases required, and my question/suspicion is that it doesn’t doesn’t really help outside of the “easy” stuff.
Yeah, I think this might be a case where we misunderstood each other.