First, no, the AGI is not going to “employ complex heuristics to ever-better approximate optimal hypotheses update”. The AGI is going to be based on an algorithm which, as a mathematical fact (if not proved then at least conjectured), converges to the correct hypothesis with high probability. Just like we can prove that e.g. SVMs converge to the optimal hypothesis in the respective class, or that particular RL algorithms for small MDPs converge to the correct hypothesis (assuming realizability).
Second, there’s the issue of non-cartesian attacks (“hacking the computer”). Assuming that the core computing unit is not powerful enough to mount a non-cartesian attack on its own, such attacks can arguably be regarded as detrimental side-effects of running computations on the envelope. My hope is that we can shape the prior about such side-effects in some informed way (e.g. the vast majority of programs won’t hack the computer) s.t. we still have approximate learnability (i.e. the system is not too afraid to run computations) without misspecification (i.e. the system is not overconfident about the safety of running computations). The more effort we put into hardening the system, the easier it should be to find such a sweet spot.
Third, I hope that the agreement solution will completely rule out any undesirable hypothesis, because we will have an actual theorem that guarantees it. What are the exact assumptions going to be and what needs to be done to make sure these assumptions hold is work for the future, ofc.
The AGI is going to be based on an algorithm which, as a mathematical fact (if not proved then at least conjectured), converges to the correct hypothesis with high probability.
I understand now, that was the main misunderstanding motivating my worries. This and your other two points have driven home for me the role mathematical guarantees play in the protocol, which I wasn’t contemplating. Thanks again for your kind answers!
First, no, the AGI is not going to “employ complex heuristics to ever-better approximate optimal hypotheses update”. The AGI is going to be based on an algorithm which, as a mathematical fact (if not proved then at least conjectured), converges to the correct hypothesis with high probability. Just like we can prove that e.g. SVMs converge to the optimal hypothesis in the respective class, or that particular RL algorithms for small MDPs converge to the correct hypothesis (assuming realizability).
Second, there’s the issue of non-cartesian attacks (“hacking the computer”). Assuming that the core computing unit is not powerful enough to mount a non-cartesian attack on its own, such attacks can arguably be regarded as detrimental side-effects of running computations on the envelope. My hope is that we can shape the prior about such side-effects in some informed way (e.g. the vast majority of programs won’t hack the computer) s.t. we still have approximate learnability (i.e. the system is not too afraid to run computations) without misspecification (i.e. the system is not overconfident about the safety of running computations). The more effort we put into hardening the system, the easier it should be to find such a sweet spot.
Third, I hope that the agreement solution will completely rule out any undesirable hypothesis, because we will have an actual theorem that guarantees it. What are the exact assumptions going to be and what needs to be done to make sure these assumptions hold is work for the future, ofc.
I understand now, that was the main misunderstanding motivating my worries. This and your other two points have driven home for me the role mathematical guarantees play in the protocol, which I wasn’t contemplating. Thanks again for your kind answers!