very tiny fine print hidden somewhere along the lines of “PS: This is all a lie! You implement this and I’m going to turn you all into meaty paperclips. Haha!” and it would get past the deception detector.
A polygraph would classify this as deception. I don’t think should pass the test. Any way of leaving a faulty understanding is deception.
I think you would be more screwed the more that you rely on yourself building the context for your question. The more general the question the safer you are. However in the details your imagination would mislead you. Also quesitions that are ilformed may get any answer while being truthful.
I would expect that any questions that I know will be useful to me I would be confused (and expect to be so) by the answers I get.
I am describing an actual device; those are its limitations. Hence this post. What would you feel safe using it for, and how would you protect yourself?
The assumed properties seem to be in conflict. The most likely thing is that you have misinterpreted what the device does which is way less than detecting deception. Therefore answering the hypothetical won’t apply as is to the machine. Since it seems you don’t want to spell out the details you need to get more spesific and believeable abstraction for what it does.
What I trivially guess from this comment it works by detecting a area of the AI where it answers the question to itself and then compares whether it differs to what it says to you. This kind of design would light the lamp when the AI honestly takes into account the way the answer hearer structures information that differs from the AIs internal representations constituting a false positive if interpreted as detecting deception. It would also produce a lot of false negatives when the answer receiver would differ in interpretation of the internal representations of the AI which is highly likely as the AI and the answer hearer propably don’t share a cognitive archetecture.
The above description doesn’t contain much technical info on how to pull that what is claimed off but opens up the principles on how the output is produced much more than a summary “detects deception”. You would need to provide a similar level description before I will bother to continue entertain the hypothetical.
For certain categories of AGI designs, the detector is able to validate from computaitonal trace logs and electrical meters that the answer given was selected using utility function F, no part of F was left out, and the logs weren’t faked or modified. The response includes both the answer from the machine and this summarized utility function extracted by the detector. There can be automatic detection of tampering with the logs which are fed into the detector, but other kinds of deceoption detection require some tool-assisted inspection of the utility function.
Sorry I won’t post details beyond that, as that would violate LW’s policy about discussing AGI implementation details.
I am reading this as at some point in the AI it’s future behaviour was completely determined by a (very detailed) principle. It’s a little ambigious on how far the hits are attempted. I am thinking of a strategy where the AI would entertain different kinds of making itself naiive with different processes would predictably end into diffrent answers. That is if it has an utility function of utility functions and the verification mechanism only tracks the last utility function point that totally determined the future behaviour (as if the naive version was the whole truth).
There is also no mention whether the F need to be epistemically relevant. It need not model how the AI would view to world just actual control over it’s behaviour. Also for answers that don’t reference any claim on how the answer was reached this just means the answer was generated with a method (which I am not sure increases information). If the output was “According to relativistic utilitarianism you should do X” you can check that F doesn’t have any relation to relativistic utilitarianism. However if the output is plainly “you should do X” or “I think you should do X” this doesn’t tell anything the AI couldn’t be caught about.
Also if the F is a very straight to the face psychopathic lier principle it’s not clear it would be “busted” about it. F need not be human readable and it might very easily by obfuscable by including details that almost never get triggered but as read on a paper would suggest a certain kind of story (ie insertion of dead code for purposes of communicating via the verification process). Or it could just spout “You should kill yourself” or “you should do this activity” and that activity happens ot be lethal but if it honestly has that opinion it would not be busted as a liar.
I guess with these discoveries I would lean on trying to make questions that will need to claim that paricular theories have particular results. But that seems akin to running foreign code that I know to be able to check that F is an instance off. Thus I could not tap into the AIs original insigths which seems to miss a lot of the power (after all if I have the computation rules to abide by why not just run them on dumb dedicated hardware=).
That is it gives no guarantee about the AIs own original input being fit for answering the question. Ie I can’t trust his word on anyting and evertying that I can trust can be traced back to the content of the question.
The AI doesn’t output the utility function, the deception detector does. It runs at a protected layer where it observes the AI program in action and summarizes its behavior, compressing large computations down to a description of the search space and methodology, the goal requirements (utility function) and the results.
But regardless, there are two types of protection provided. The machine ensures that (1) the AI operates according to certain principles, which allow summarization / compression of its results, and (2) a reasonably accurate estimate of the total work performed allows calculation of energetic requirements which can be checked against the electric meter. Deception requires plotting and scheming. Plotting and scheming costs energy points.
So a program that intentionally tries to deceive its operators would be found out. First because that intention to deceive would be pulled out and identified as a goal by the detector, something which the program can only get around by breaking out of the box containing it. Looking for a breakout exploit would have been caught by the detector.
But assuming it does breakout, there are limits to what it can achieve. It still has to pass back a computational trace which validates to a human-approved set of goals. That requires geometrically more computational power as the program must find a solution which satisfies G(friendly-looking) & G(evil), then reverse a fraudulent computational trace which shows the same solution deriving from a search for G(friendly-looking) only. The “hidden” computation looking for G(friendly-looking) & G(evil) has nowhere to hide.
If the AI used a efficient method to scheming and then calculated a semi-efficent “legit” computation and then faked of a really energy ineffficent log it could fit extra baggage within the energy budjet.
It could do a punch of stuff that could be dangerous. It could plan to be dangerously neglectfull. Identifying and objecting to all that spectra of misbehaviour isn’t a trivial matter. I think a big danger is a human reading a legit log of harmful activity and approving of it. ” I guess murdering babies is a legit step on this friendly calculation” when the subject matter on whether a particular thing is warranted or not requires expert oppinion. In a way you want it not to raise your suspisions you want it to try to be friendly. For example if it wants you to build a certain molecyle it could either be a good miracle nanobot or a highly virulent virus. If there is no info why it wants to do it judging this molecyle migth fall outside the expertise of the human. However the reasoning why it wants to be that kind could be a generic goodness principle that sounds fine on paper “Nanobots improve health therefore I want to make a nanobot” (a failure mode could be that the health conception of it could be too indexical, healthy for him to propagate his likeness into biological computer substrata). Would the illegit goal detector catch on to the too indexical health conception?
If the AI knows that is being analysed in practise the cognitive decisions it does is a form of output as the analysis is causally dependent on the cognitive choices.
It can’t fake a log without actually doing work >= the purported work of the log.
If the AI knows that is being analysed in practise the cognitive decisions it does is a form of output as the analysis is causally dependent on the cognitive choices.
What I’m aparantly failing to get across to you is that meta-level thinking wlil also be reflected in the output of the deception detector. The deception detector always by design operates one level higher than the program being audited.
“Actual device” in the “physically resizable” sense. It is an actual design for a device, and various pieces already written or prototyped. The deception detector is easier to implement then the AGI, obviously, and further constrains the AGI design space (the AI has to work a certain way for the detector to work).
Much like how ITER is an “actual” fusion power plant even though it hasn’t been built yet.
A polygraph would classify this as deception. I don’t think should pass the test. Any way of leaving a faulty understanding is deception.
I think you would be more screwed the more that you rely on yourself building the context for your question. The more general the question the safer you are. However in the details your imagination would mislead you. Also quesitions that are ilformed may get any answer while being truthful.
I would expect that any questions that I know will be useful to me I would be confused (and expect to be so) by the answers I get.
I am describing an actual device; those are its limitations. Hence this post. What would you feel safe using it for, and how would you protect yourself?
The assumed properties seem to be in conflict. The most likely thing is that you have misinterpreted what the device does which is way less than detecting deception. Therefore answering the hypothetical won’t apply as is to the machine. Since it seems you don’t want to spell out the details you need to get more spesific and believeable abstraction for what it does.
What I trivially guess from this comment it works by detecting a area of the AI where it answers the question to itself and then compares whether it differs to what it says to you. This kind of design would light the lamp when the AI honestly takes into account the way the answer hearer structures information that differs from the AIs internal representations constituting a false positive if interpreted as detecting deception. It would also produce a lot of false negatives when the answer receiver would differ in interpretation of the internal representations of the AI which is highly likely as the AI and the answer hearer propably don’t share a cognitive archetecture.
The above description doesn’t contain much technical info on how to pull that what is claimed off but opens up the principles on how the output is produced much more than a summary “detects deception”. You would need to provide a similar level description before I will bother to continue entertain the hypothetical.
For certain categories of AGI designs, the detector is able to validate from computaitonal trace logs and electrical meters that the answer given was selected using utility function F, no part of F was left out, and the logs weren’t faked or modified. The response includes both the answer from the machine and this summarized utility function extracted by the detector. There can be automatic detection of tampering with the logs which are fed into the detector, but other kinds of deceoption detection require some tool-assisted inspection of the utility function.
Sorry I won’t post details beyond that, as that would violate LW’s policy about discussing AGI implementation details.
I am reading this as at some point in the AI it’s future behaviour was completely determined by a (very detailed) principle. It’s a little ambigious on how far the hits are attempted. I am thinking of a strategy where the AI would entertain different kinds of making itself naiive with different processes would predictably end into diffrent answers. That is if it has an utility function of utility functions and the verification mechanism only tracks the last utility function point that totally determined the future behaviour (as if the naive version was the whole truth).
There is also no mention whether the F need to be epistemically relevant. It need not model how the AI would view to world just actual control over it’s behaviour. Also for answers that don’t reference any claim on how the answer was reached this just means the answer was generated with a method (which I am not sure increases information). If the output was “According to relativistic utilitarianism you should do X” you can check that F doesn’t have any relation to relativistic utilitarianism. However if the output is plainly “you should do X” or “I think you should do X” this doesn’t tell anything the AI couldn’t be caught about.
Also if the F is a very straight to the face psychopathic lier principle it’s not clear it would be “busted” about it. F need not be human readable and it might very easily by obfuscable by including details that almost never get triggered but as read on a paper would suggest a certain kind of story (ie insertion of dead code for purposes of communicating via the verification process). Or it could just spout “You should kill yourself” or “you should do this activity” and that activity happens ot be lethal but if it honestly has that opinion it would not be busted as a liar.
I guess with these discoveries I would lean on trying to make questions that will need to claim that paricular theories have particular results. But that seems akin to running foreign code that I know to be able to check that F is an instance off. Thus I could not tap into the AIs original insigths which seems to miss a lot of the power (after all if I have the computation rules to abide by why not just run them on dumb dedicated hardware=).
That is it gives no guarantee about the AIs own original input being fit for answering the question. Ie I can’t trust his word on anyting and evertying that I can trust can be traced back to the content of the question.
The AI doesn’t output the utility function, the deception detector does. It runs at a protected layer where it observes the AI program in action and summarizes its behavior, compressing large computations down to a description of the search space and methodology, the goal requirements (utility function) and the results.
But regardless, there are two types of protection provided. The machine ensures that (1) the AI operates according to certain principles, which allow summarization / compression of its results, and (2) a reasonably accurate estimate of the total work performed allows calculation of energetic requirements which can be checked against the electric meter. Deception requires plotting and scheming. Plotting and scheming costs energy points.
So a program that intentionally tries to deceive its operators would be found out. First because that intention to deceive would be pulled out and identified as a goal by the detector, something which the program can only get around by breaking out of the box containing it. Looking for a breakout exploit would have been caught by the detector.
But assuming it does breakout, there are limits to what it can achieve. It still has to pass back a computational trace which validates to a human-approved set of goals. That requires geometrically more computational power as the program must find a solution which satisfies G(friendly-looking) & G(evil), then reverse a fraudulent computational trace which shows the same solution deriving from a search for G(friendly-looking) only. The “hidden” computation looking for G(friendly-looking) & G(evil) has nowhere to hide.
If the AI used a efficient method to scheming and then calculated a semi-efficent “legit” computation and then faked of a really energy ineffficent log it could fit extra baggage within the energy budjet.
It could do a punch of stuff that could be dangerous. It could plan to be dangerously neglectfull. Identifying and objecting to all that spectra of misbehaviour isn’t a trivial matter. I think a big danger is a human reading a legit log of harmful activity and approving of it. ” I guess murdering babies is a legit step on this friendly calculation” when the subject matter on whether a particular thing is warranted or not requires expert oppinion. In a way you want it not to raise your suspisions you want it to try to be friendly. For example if it wants you to build a certain molecyle it could either be a good miracle nanobot or a highly virulent virus. If there is no info why it wants to do it judging this molecyle migth fall outside the expertise of the human. However the reasoning why it wants to be that kind could be a generic goodness principle that sounds fine on paper “Nanobots improve health therefore I want to make a nanobot” (a failure mode could be that the health conception of it could be too indexical, healthy for him to propagate his likeness into biological computer substrata). Would the illegit goal detector catch on to the too indexical health conception?
If the AI knows that is being analysed in practise the cognitive decisions it does is a form of output as the analysis is causally dependent on the cognitive choices.
It can’t fake a log without actually doing work >= the purported work of the log.
What I’m aparantly failing to get across to you is that meta-level thinking wlil also be reflected in the output of the deception detector. The deception detector always by design operates one level higher than the program being audited.
An actual device?!?
“Actual device” in the “physically resizable” sense. It is an actual design for a device, and various pieces already written or prototyped. The deception detector is easier to implement then the AGI, obviously, and further constrains the AGI design space (the AI has to work a certain way for the detector to work).
Much like how ITER is an “actual” fusion power plant even though it hasn’t been built yet.