We would still have to address that worry mentioned earlier about “formally specifying complex requirements such as “don’t drive humanity extinct.”” I have not a clue. Anyone have ideas?
Sure. To start with an easier example:
Where I get confused is, what would it mean to prove that a given set of code will do even the straightforward tasks like proving cybersecurity. How does one prove that you cannot gain access without proper credentials? Doesn’t this fact rely upon physical properties, lest there be a bug or physical manipulation one can make?
No idea if the below is what Tegmark and Omohundro meant, but here’s how I think it can theoretically work:
Consider frameworks like the Bayesian probability theory or various decision theories, which (strive to) establish the formally correct algorithms for how systems embedded in a universe larger than themselves must act, even under various uncertainties. How to update on observations, what decisions to make given what information, etc. They still take on “first-person” perspective, they assume that you’re operating on models of reality rather than the reality directly — but they strive to be formally correct given this setup.
Plausibly, this could be extended to agents and models of reality in general. There may be a correct algorithm for being a goal-pursuing agent, and a correct way to build world-models. (See the Natural Abstractions research direction or my case for it.)
So, hypothetically, hopefully, we could design an algorithm that builds a world-model in a formally correct manner (or, at least, does that approximately — with “approximately” likewise defined in a strict mathematical manner), and then pursues pre-specified goals in it in a strictly correct fashion.
We then find a concept in this “correct” world-model that corresponds to “cybersecurity”, define a goal like “output formally secure code meeting the following specifications (likewise defined in terms of derived concepts)”, and hand that goal off to our “correct” agency algorithm.
Our agent will then do its best to ensure the code it produces is secure even under real physical conditions, etc. And it won’t be mere speculation or babbling, it won’t simply “do its best” — it will do as well as it possibly theoretically could have.
In the end, we’d end up with a system that uses (the approximation of) a formally correct world-model to pursue (the approximation of) a formally defined goal in a(n approximately) formally correct manner, with every step of the process resting on hard mathematical proofs.
I actually do think this is more doable than it sounds, but, uh… It’s definitely not “an unsolved yet easier” challenge compared to robust AGI alignment. Nay, it is in fact an AGI-alignment-complete problem.
Consider frameworks like the Bayesian probability theory or various decision theories, which (strive to) establish the formally correct algorithms for how systems embedded in a universe larger than themselves must act, even under various uncertainties. How to update on observations, what decisions to make given what information, etc. They still take on “first-person” perspective, they assume that you’re operating on models of reality rather than the reality directly — but they strive to be formally correct given this setup.
Admittedly, this does have 1 big problem, and I’ll list them down below:
Formalization is a pain in the butt, and we have good reasons to believe that formalizing things will be so hard as to essentially be impossible in practice, except in very restricted circumstances. In particular, this is one way in which rationalism and Bayesian reasoning fail to scale down: They assume either infinite computation, or in the regime of bounded Bayesian reasoning/rationality, they assume the ability to solve very difficult problems like NP-complete/Co-NP complete/#P-complete/PSPACE-complete problems or worse. This is generally the reason why formal frameworks don’t work out very well in the real world: Absent oddball assumptions on physics, we probably won’t be able to solve things formally in a lot of cases for forever.
So the question is, why do you believe formalization is tractable at all for the AI safety problem?
So the question is, why do you believe formalization is tractable at all for the AI safety problem?
It’s more that I don’t positively believe it’s not tractable. Some of my reasoning is outlined here, some of it is based on inferences and models that I’m going to distill and post publicly aaaany day now, and mostly it’s an inside-view feel for what problems remain and how hopeless-to-solve they feel.
Which is to say, I can absolutely see how a better AGI paradigm may be locked behind theoretical challenges on the difficulty level of “prove that P≠NP”, and I certainly wouldn’t bet the civilization on solving them in the next five or seven years. But I think it’s worth keeping an eye out for whether e. g. some advanced interpretability tool we invent turns out to have a dual use as a foundation of such a paradigm or puts it within reach.
They assume either infinite computation, or in the regime of bounded Bayesian reasoning/rationality, they assume the ability to solve very difficult problems
Yeah, this is why I added “approximation of” to every “formal” in the summary in my original comment. I have some thoughts on looping in computational complexity into agency theory, but that may not even be necessary.
Sure. To start with an easier example:
No idea if the below is what Tegmark and Omohundro meant, but here’s how I think it can theoretically work:
Consider frameworks like the Bayesian probability theory or various decision theories, which (strive to) establish the formally correct algorithms for how systems embedded in a universe larger than themselves must act, even under various uncertainties. How to update on observations, what decisions to make given what information, etc. They still take on “first-person” perspective, they assume that you’re operating on models of reality rather than the reality directly — but they strive to be formally correct given this setup.
Plausibly, this could be extended to agents and models of reality in general. There may be a correct algorithm for being a goal-pursuing agent, and a correct way to build world-models. (See the Natural Abstractions research direction or my case for it.)
So, hypothetically, hopefully, we could design an algorithm that builds a world-model in a formally correct manner (or, at least, does that approximately — with “approximately” likewise defined in a strict mathematical manner), and then pursues pre-specified goals in it in a strictly correct fashion.
We then find a concept in this “correct” world-model that corresponds to “cybersecurity”, define a goal like “output formally secure code meeting the following specifications (likewise defined in terms of derived concepts)”, and hand that goal off to our “correct” agency algorithm.
Our agent will then do its best to ensure the code it produces is secure even under real physical conditions, etc. And it won’t be mere speculation or babbling, it won’t simply “do its best” — it will do as well as it possibly theoretically could have.
In the end, we’d end up with a system that uses (the approximation of) a formally correct world-model to pursue (the approximation of) a formally defined goal in a(n approximately) formally correct manner, with every step of the process resting on hard mathematical proofs.
I actually do think this is more doable than it sounds, but, uh… It’s definitely not “an unsolved yet easier” challenge compared to robust AGI alignment. Nay, it is in fact an AGI-alignment-complete problem.
Admittedly, this does have 1 big problem, and I’ll list them down below:
Formalization is a pain in the butt, and we have good reasons to believe that formalizing things will be so hard as to essentially be impossible in practice, except in very restricted circumstances. In particular, this is one way in which rationalism and Bayesian reasoning fail to scale down: They assume either infinite computation, or in the regime of bounded Bayesian reasoning/rationality, they assume the ability to solve very difficult problems like NP-complete/Co-NP complete/#P-complete/PSPACE-complete problems or worse. This is generally the reason why formal frameworks don’t work out very well in the real world: Absent oddball assumptions on physics, we probably won’t be able to solve things formally in a lot of cases for forever.
So the question is, why do you believe formalization is tractable at all for the AI safety problem?
It’s more that I don’t positively believe it’s not tractable. Some of my reasoning is outlined here, some of it is based on inferences and models that I’m going to distill and post publicly aaaany day now, and mostly it’s an inside-view feel for what problems remain and how hopeless-to-solve they feel.
Which is to say, I can absolutely see how a better AGI paradigm may be locked behind theoretical challenges on the difficulty level of “prove that P≠NP”, and I certainly wouldn’t bet the civilization on solving them in the next five or seven years. But I think it’s worth keeping an eye out for whether e. g. some advanced interpretability tool we invent turns out to have a dual use as a foundation of such a paradigm or puts it within reach.
Yeah, this is why I added “approximation of” to every “formal” in the summary in my original comment. I have some thoughts on looping in computational complexity into agency theory, but that may not even be necessary.