The AI doesn’t output the utility function, the deception detector does. It runs at a protected layer where it observes the AI program in action and summarizes its behavior, compressing large computations down to a description of the search space and methodology, the goal requirements (utility function) and the results.
But regardless, there are two types of protection provided. The machine ensures that (1) the AI operates according to certain principles, which allow summarization / compression of its results, and (2) a reasonably accurate estimate of the total work performed allows calculation of energetic requirements which can be checked against the electric meter. Deception requires plotting and scheming. Plotting and scheming costs energy points.
So a program that intentionally tries to deceive its operators would be found out. First because that intention to deceive would be pulled out and identified as a goal by the detector, something which the program can only get around by breaking out of the box containing it. Looking for a breakout exploit would have been caught by the detector.
But assuming it does breakout, there are limits to what it can achieve. It still has to pass back a computational trace which validates to a human-approved set of goals. That requires geometrically more computational power as the program must find a solution which satisfies G(friendly-looking) & G(evil), then reverse a fraudulent computational trace which shows the same solution deriving from a search for G(friendly-looking) only. The “hidden” computation looking for G(friendly-looking) & G(evil) has nowhere to hide.
If the AI used a efficient method to scheming and then calculated a semi-efficent “legit” computation and then faked of a really energy ineffficent log it could fit extra baggage within the energy budjet.
It could do a punch of stuff that could be dangerous. It could plan to be dangerously neglectfull. Identifying and objecting to all that spectra of misbehaviour isn’t a trivial matter. I think a big danger is a human reading a legit log of harmful activity and approving of it. ” I guess murdering babies is a legit step on this friendly calculation” when the subject matter on whether a particular thing is warranted or not requires expert oppinion. In a way you want it not to raise your suspisions you want it to try to be friendly. For example if it wants you to build a certain molecyle it could either be a good miracle nanobot or a highly virulent virus. If there is no info why it wants to do it judging this molecyle migth fall outside the expertise of the human. However the reasoning why it wants to be that kind could be a generic goodness principle that sounds fine on paper “Nanobots improve health therefore I want to make a nanobot” (a failure mode could be that the health conception of it could be too indexical, healthy for him to propagate his likeness into biological computer substrata). Would the illegit goal detector catch on to the too indexical health conception?
If the AI knows that is being analysed in practise the cognitive decisions it does is a form of output as the analysis is causally dependent on the cognitive choices.
It can’t fake a log without actually doing work >= the purported work of the log.
If the AI knows that is being analysed in practise the cognitive decisions it does is a form of output as the analysis is causally dependent on the cognitive choices.
What I’m aparantly failing to get across to you is that meta-level thinking wlil also be reflected in the output of the deception detector. The deception detector always by design operates one level higher than the program being audited.
The AI doesn’t output the utility function, the deception detector does. It runs at a protected layer where it observes the AI program in action and summarizes its behavior, compressing large computations down to a description of the search space and methodology, the goal requirements (utility function) and the results.
But regardless, there are two types of protection provided. The machine ensures that (1) the AI operates according to certain principles, which allow summarization / compression of its results, and (2) a reasonably accurate estimate of the total work performed allows calculation of energetic requirements which can be checked against the electric meter. Deception requires plotting and scheming. Plotting and scheming costs energy points.
So a program that intentionally tries to deceive its operators would be found out. First because that intention to deceive would be pulled out and identified as a goal by the detector, something which the program can only get around by breaking out of the box containing it. Looking for a breakout exploit would have been caught by the detector.
But assuming it does breakout, there are limits to what it can achieve. It still has to pass back a computational trace which validates to a human-approved set of goals. That requires geometrically more computational power as the program must find a solution which satisfies G(friendly-looking) & G(evil), then reverse a fraudulent computational trace which shows the same solution deriving from a search for G(friendly-looking) only. The “hidden” computation looking for G(friendly-looking) & G(evil) has nowhere to hide.
If the AI used a efficient method to scheming and then calculated a semi-efficent “legit” computation and then faked of a really energy ineffficent log it could fit extra baggage within the energy budjet.
It could do a punch of stuff that could be dangerous. It could plan to be dangerously neglectfull. Identifying and objecting to all that spectra of misbehaviour isn’t a trivial matter. I think a big danger is a human reading a legit log of harmful activity and approving of it. ” I guess murdering babies is a legit step on this friendly calculation” when the subject matter on whether a particular thing is warranted or not requires expert oppinion. In a way you want it not to raise your suspisions you want it to try to be friendly. For example if it wants you to build a certain molecyle it could either be a good miracle nanobot or a highly virulent virus. If there is no info why it wants to do it judging this molecyle migth fall outside the expertise of the human. However the reasoning why it wants to be that kind could be a generic goodness principle that sounds fine on paper “Nanobots improve health therefore I want to make a nanobot” (a failure mode could be that the health conception of it could be too indexical, healthy for him to propagate his likeness into biological computer substrata). Would the illegit goal detector catch on to the too indexical health conception?
If the AI knows that is being analysed in practise the cognitive decisions it does is a form of output as the analysis is causally dependent on the cognitive choices.
It can’t fake a log without actually doing work >= the purported work of the log.
What I’m aparantly failing to get across to you is that meta-level thinking wlil also be reflected in the output of the deception detector. The deception detector always by design operates one level higher than the program being audited.