(Posting as a top-level comment, but this is mainly a response to @the gears to ascension’s request for perspectives here.)
I like this:
One core aspect of our theory of change is backchaining: come up with an at least remotely plausiblestory for how the world is saved from AI doom, and try to think about how to get there.
as a general strategy. In terms of Orthogonal’s overall approach and QACI specifically, one thing I’d like to see more of is how it can be applied to relatively easier (or at least, plausibly easier) subproblems like the symbol grounding problem, corrigibility, and diamond maximization, separately and independently from using it to solve alignment in general.
I can’t find the original source, but I think someone (Nate Soares, maybe somewhere in the 2021 MIRI conversations?), once said something somewhere that robust alignment strategies should scale and degrade gracefully: they shouldn’t depend on solving only the hardest problem, and avoiding catastrophic failure shouldn’t depend on superintelligent capability levels. (I might be mis-remembering or imagining this wholesale, but I agree with the basic idea, as I’ve stated it here. Another way of putting it: ideally, you want some “capabilities” parameter in a system that you can dial up gradually, and then turn that dial just enough to solve the weakest problem that ends the acute risk period. Maybe afterwards, you use the same system, dialed up even further, to bring about the GTF, but regardless, you should be able to do easy things before you do hard things.)
I’m not sure that QACI doesn’t have these desiderata, but I’m not sure that it does either.
In any case, very much looking forward to more from Orthogonal!
Recently we modified QACI to give a scoring over actions, instead of over worlds. This should allow weaker systems inner aligned to QACI to output weaker non-DSA actions, such as the textbook from the future, or just human readable advice on how to end the acute risk period. Stronger systems might output instructions for how to go about solving corrigible AI, or something to this effect.
As for diamonds, we believe this is actually a harder problem than alignment, and it’s a mistake to aim at it. Solving diamond-maximization requires us to point at what we mean by “maximizing diamonds” in physics in a way which is ontologically robust. QACI instead gives us an easier target; informational data blobs which causally relate to a human. The cost is that we now give up power to that human user to implement their values, but this is no issue since that what we wanted to do anyways. If the humans in the QACI interval were actually pursuing diamond-maximization, instead of some form of human values, QACI would solve diamond maximization.
Also, regarding ontologies and having a formal goal which is ontology-independent (?): I’m curious for Orthogonal’s take on e.g. Finding gliders in the game of life, in terms of the role you see for this kind of research, whether the conclusions and ideas in that post specifically are on the right track, and how they relate to QACI.
I interpret the post you linked as trying to solve the problem of pointing to things in the real world. Being able to point to things in the real world in a way which is ontologically robust is probably necessary for alignment. However “gliders”, “strawberries” and “diamonds” seem like incredibly complicated objects to point to in a way which is ontologically robust, and it is not clear that being able to point to these objects actually lead to any kind of solution.
What we are interested in is research into how to create a statistically unique enough piece of data and being able to reliably point to that. Pointing to pure information seems like it would be more physics independent and run into less issues with ontological breakdowns.
The QACI scheme allows us to construct more complicated formal objects, using counterfactuals on these pieces of data, out of which we are able to construct a long reflection process.
(Posting as a top-level comment, but this is mainly a response to @the gears to ascension’s request for perspectives here.)
I like this:
as a general strategy. In terms of Orthogonal’s overall approach and QACI specifically, one thing I’d like to see more of is how it can be applied to relatively easier (or at least, plausibly easier) subproblems like the symbol grounding problem, corrigibility, and diamond maximization, separately and independently from using it to solve alignment in general.
I can’t find the original source, but I think someone (Nate Soares, maybe somewhere in the 2021 MIRI conversations?), once said something somewhere that robust alignment strategies should scale and degrade gracefully: they shouldn’t depend on solving only the hardest problem, and avoiding catastrophic failure shouldn’t depend on superintelligent capability levels. (I might be mis-remembering or imagining this wholesale, but I agree with the basic idea, as I’ve stated it here. Another way of putting it: ideally, you want some “capabilities” parameter in a system that you can dial up gradually, and then turn that dial just enough to solve the weakest problem that ends the acute risk period. Maybe afterwards, you use the same system, dialed up even further, to bring about the GTF, but regardless, you should be able to do easy things before you do hard things.)
I’m not sure that QACI doesn’t have these desiderata, but I’m not sure that it does either.
In any case, very much looking forward to more from Orthogonal!
Recently we modified QACI to give a scoring over actions, instead of over worlds. This should allow weaker systems inner aligned to QACI to output weaker non-DSA actions, such as the textbook from the future, or just human readable advice on how to end the acute risk period. Stronger systems might output instructions for how to go about solving corrigible AI, or something to this effect.
As for diamonds, we believe this is actually a harder problem than alignment, and it’s a mistake to aim at it. Solving diamond-maximization requires us to point at what we mean by “maximizing diamonds” in physics in a way which is ontologically robust. QACI instead gives us an easier target; informational data blobs which causally relate to a human. The cost is that we now give up power to that human user to implement their values, but this is no issue since that what we wanted to do anyways. If the humans in the QACI interval were actually pursuing diamond-maximization, instead of some form of human values, QACI would solve diamond maximization.
Also, regarding ontologies and having a formal goal which is ontology-independent (?): I’m curious for Orthogonal’s take on e.g. Finding gliders in the game of life, in terms of the role you see for this kind of research, whether the conclusions and ideas in that post specifically are on the right track, and how they relate to QACI.
I interpret the post you linked as trying to solve the problem of pointing to things in the real world. Being able to point to things in the real world in a way which is ontologically robust is probably necessary for alignment. However “gliders”, “strawberries” and “diamonds” seem like incredibly complicated objects to point to in a way which is ontologically robust, and it is not clear that being able to point to these objects actually lead to any kind of solution.
What we are interested in is research into how to create a statistically unique enough piece of data and being able to reliably point to that. Pointing to pure information seems like it would be more physics independent and run into less issues with ontological breakdowns.
The QACI scheme allows us to construct more complicated formal objects, using counterfactuals on these pieces of data, out of which we are able to construct a long reflection process.