Simon Fischer comments on How to safely use an optimizer

Simon Fischer 28 Mar 2024 21:57 UTC
7 points
2
First thought: The oracle is going to choose to systematically answer or not answer the queries we give it. This represents a causal channel of one bit per query it can use to influence the outside world^[1]. Can you conquer the world in one awkwardly delivered kilobyte or less? Maybe.
Agreed. I think it’s potentially a good bit worse than one kilobyte if let ourselves bet tricked to ask many questions, different questions or lower the difficulty of the safety constraint too much.
As mentioned in footnote 10, this requires a kind of perfect coordination between different Oracles that we manage to build, where the Oracles have to be able to correctly predict for all kinds of questions whether humanity (with the help of other Oracles) has been able to figure out the answers (or the rough sizes of various sets). Otherwise, when caught in deception, we can (as you noted) scrap or retrain the Oracles.
Second thought: What kind of optimisation problems can we specify well enough for a formal proof checker to tell whether they’ve been satisficed? Are they the kind of problems where solving them can save the world?
You’re correct in that this approach is only useful as far as validation is easier then generation. Though for this technique, the validation doesn’t have to be by a formal proof checker, any program that you can run suffices. It might even be a very slow program (e.g. a big LLM) if you have an effective way communicate your goal set to the Oracle (e.g. using a natural language prompt as we already use today when using current AIs).