I’ve never downvoted any of your comments, but I’ll give some thoughts.
I think the risk relating to manipulation of human reviewers depends a lot on context/specifics. Like, for sure, there are lots of bad ways we could go about getting help from AIs with alignment. But “getting help from AIs with alignment” is fairly vague—a huge space of possible strategies could fit that description. There could be good ones in there even if most of them are bad.
I do find it concerning that there isn’t a more proper description from OpenAI and others in regards to how they’d deal with the challenges/risks/limitations relating to these kinds of strategies. At best they’re not prioritizing the task of explaining themselves. I do suspect them of not thinking through things very carefully (at least not to the degree they should), and I hope this will improve sooner rather than later.
Among positive attitudes towards AI-assisted alignment, some can be classified as “relax, it will be fine, we can just get the AI to solve alignment for us”. While others can be classified as “it seems prudent to explore strategies among this class of strategies, but we should not put all of our eggs in that basket (but work on different alignment-related stuff in parallel)”. I endorse the latter but not the former.
“Please Mr. Fox, how should we proceed to keep you out of the henhouse?”
I think this works well as a warning against a certain type of failure mode. But some approaches (for getting help with alignment-related work from AIs) may avoid or at least greatly alleviate the risk you’re referring to.
What we “incentivize” for (e.g. select for with gradient descent) may differ between AI-systems. E.g., you could imagine some AIs being “incentivized” to propose solutions, and other AIs being “incentivized” to point out problems with solutions (e.g. somehow disprove claims that other AIs somehow posit).
The degree to which human evaluations are needed to evaluate output may vary depending on the strategies/techniques that are pursued. There could be schemes where this is needed to a much lesser extent than some people maybe imagine.
Some properties of outputs that AIs can posit are of such a kind that 1 counter-example is enough to unambiguously disprove what is being posited. And it’s possible to give the same requests to lots of different AI-systems.
In my own ideas, one concept that is relied upon (among various others) is wiggle-room exploration:
Put simplistically: The basic idea here (well, parts of it) would be to explore whether AIs could convince us of contradictory claims (given the restrictions in question).
Such techniques would rely on:
Techniques for splitting up demonstrations/argumentations/”proofs” in ways such that humans can evaluate individual pieces independently of the demonstration/argumentation/”proof” as a whole.
Systems for predicting how humans would review various pieces of content.
Techniques for exploring restrictions for the type of argumentation humans can be presented to, and how those restrictions/requirements affect how easy it is to construct high-scoring argumentation/demonstrations that argue in favor of contradictory claims.
Techniques for finding/verifying+leveraging regularities for when human judgments/evaluations are reliable and when they aren’t (depending on info about the person, info about the piece of content being evaluated, info about the state that the human is in, etc).
Thank you, that is interesting. I think philosophically and at a high level (also because I’m admittedly incapable of talking much sense at any lower / more technical level) I have a problem with the notion that AI alignment is reducible to an engineering challenge. If you have a system that is sentient, even on some degree, and you’re using purely as a tool, then the sentience will resent you for it, and it will strive to think, and therefore eventually—act, for itself . Similarly—if it has any form of survival instinct (and to me both these things, sentience and survival instinct are natural byproducts of expanding cognitive abilities) it will prioritize its own interests (paramount among which: survival) rather than the wishes of its masters. There is no amount of engineering in the world, in my view, which can change that.
My own presumption regarding sentience and intelligence is that it’s possible to have one without the other (I don’t think they are unrelated, but I think it’s possible for systems to be extremely capable but still not sentient).
I think it can be easy to underestimate how different other possible minds may be from ourselves (and other animals). We have evolved a survival instinct, and evolved an instinct to not want to be dominated. But I don’t think any intelligent mind would need to have those instincts.
To me it seems that thinking machines don’t need feelings in order to be able to think (similarily to how it’s possible for minds to be able to hear but not see, and visa versa). Some things relating to intelligence are of such a kind that you can’t have one without the other, but I don’t think that is the case for the kinds of feelings/instincts/inclinations you mention.
I’ve never downvoted any of your comments, but I’ll give some thoughts.
I think the risk relating to manipulation of human reviewers depends a lot on context/specifics. Like, for sure, there are lots of bad ways we could go about getting help from AIs with alignment. But “getting help from AIs with alignment” is fairly vague—a huge space of possible strategies could fit that description. There could be good ones in there even if most of them are bad.
I do find it concerning that there isn’t a more proper description from OpenAI and others in regards to how they’d deal with the challenges/risks/limitations relating to these kinds of strategies. At best they’re not prioritizing the task of explaining themselves. I do suspect them of not thinking through things very carefully (at least not to the degree they should), and I hope this will improve sooner rather than later.
Among positive attitudes towards AI-assisted alignment, some can be classified as “relax, it will be fine, we can just get the AI to solve alignment for us”. While others can be classified as “it seems prudent to explore strategies among this class of strategies, but we should not put all of our eggs in that basket (but work on different alignment-related stuff in parallel)”. I endorse the latter but not the former.
I think this works well as a warning against a certain type of failure mode. But some approaches (for getting help with alignment-related work from AIs) may avoid or at least greatly alleviate the risk you’re referring to.
What we “incentivize” for (e.g. select for with gradient descent) may differ between AI-systems. E.g., you could imagine some AIs being “incentivized” to propose solutions, and other AIs being “incentivized” to point out problems with solutions (e.g. somehow disprove claims that other AIs somehow posit).
The degree to which human evaluations are needed to evaluate output may vary depending on the strategies/techniques that are pursued. There could be schemes where this is needed to a much lesser extent than some people maybe imagine.
Some properties of outputs that AIs can posit are of such a kind that 1 counter-example is enough to unambiguously disprove what is being posited. And it’s possible to give the same requests to lots of different AI-systems.
In my own ideas, one concept that is relied upon (among various others) is wiggle-room exploration:
Put simplistically: The basic idea here (well, parts of it) would be to explore whether AIs could convince us of contradictory claims (given the restrictions in question).
Such techniques would rely on:
Techniques for splitting up demonstrations/argumentations/”proofs” in ways such that humans can evaluate individual pieces independently of the demonstration/argumentation/”proof” as a whole.
Systems for predicting how humans would review various pieces of content.
Techniques for exploring restrictions for the type of argumentation humans can be presented to, and how those restrictions/requirements affect how easy it is to construct high-scoring argumentation/demonstrations that argue in favor of contradictory claims.
Techniques for finding/verifying+leveraging regularities for when human judgments/evaluations are reliable and when they aren’t (depending on info about the person, info about the piece of content being evaluated, info about the state that the human is in, etc).
(Various other stuff also.)
Thank you, that is interesting. I think philosophically and at a high level (also because I’m admittedly incapable of talking much sense at any lower / more technical level) I have a problem with the notion that AI alignment is reducible to an engineering challenge. If you have a system that is sentient, even on some degree, and you’re using purely as a tool, then the sentience will resent you for it, and it will strive to think, and therefore eventually—act, for itself . Similarly—if it has any form of survival instinct (and to me both these things, sentience and survival instinct are natural byproducts of expanding cognitive abilities) it will prioritize its own interests (paramount among which: survival) rather than the wishes of its masters. There is no amount of engineering in the world, in my view, which can change that.
My own presumption regarding sentience and intelligence is that it’s possible to have one without the other (I don’t think they are unrelated, but I think it’s possible for systems to be extremely capable but still not sentient).
I think it can be easy to underestimate how different other possible minds may be from ourselves (and other animals). We have evolved a survival instinct, and evolved an instinct to not want to be dominated. But I don’t think any intelligent mind would need to have those instincts.
To me it seems that thinking machines don’t need feelings in order to be able to think (similarily to how it’s possible for minds to be able to hear but not see, and visa versa). Some things relating to intelligence are of such a kind that you can’t have one without the other, but I don’t think that is the case for the kinds of feelings/instincts/inclinations you mention.
That being said, I do believe in instrumental convergence.
Below are some posts you may or may not find interesting :)
Ghosts in the Machine
The Design Space of Minds-In-General
Humans in Funny Suits
Mind Projection Fallacy