I like the Q&A format! I’m generally in favor of experimenting with presentation, and I think it worked well here. I was able to skip some sections I found uncontroversial, and jump back to sections I found particularly interesting or relevant while writing the meat of this comment.
I think another concrete example of a possible “goal agnostic system” is the tree search-based system I proposed here, with the evaluation function left as a free variable / thunk to be filled in by the user. Assuming none of the individual component pieces are agentic or goal-directed in their own right, or cohere into something that is, the programmer can always halt the system’s execution without any part of the system having any preference for or against that.
I think it’s plausible that such systems are practical and likely to be constructed in the near future, and have at least some of the desirable properties you claim here.
Though I would add at least one additional big caveat to your answer here about misuse:
Misuse—both accidental and intentional—is an extreme concern. The core capabilities of a sufficiently strong goal agnostic system should still be treated as riskier than nuclear weapons. If strong goal agnostic systems existed, they would be (either directly or upstream from) the greatest catastrophic and existential threat facing humanity.
Once we’re dealing with really powerful systems, introducing goal-agnosticism brings in an additional risk: accidental loss-of-control by the goal-agnostic system itself.
That is, my interpretation of what you wrote above is that you’re explicitly saying that:
Deliberate misuse by humans is a big concern.
Accidental misuse by humans is also a big concern.
And I agree with both of those, but I think there’s an implicit assumption in the second bullet that some part of the system, regarded in an agent in its own right, would still be acting deliberately with full control and knowledge of the consequences of its own actions.
But once you bring in goal-agnosticism, I think there’s also a third potential risk, which is that the system loses control of itself, in a way where the consequences on the future are vast and irreversible, but not necessarily desirable or coherent outcomes from the perspective of any part of the system itself.
Concretely, this looks like unintentional runaway grey goo, or the AI doing the equivalent of dropping an anvil on its own head, or the humans and AI accidentally goodharting or wireheading themselves into something that no one, including any part of the AI system itself, would have endorsed.
If there’s anyone left around after a disaster like that, the AI system itself might say something like:
Hmm, yeah, that was a bad outcome from my own perspective and yours. Failure analysis: In retrospect, it was probably a mistake by me not to optimize a bit harder on carefully checking all the consequences of my own actions for unintended and irreversible effects. I spent some time thinking about them and was reasonably sure that nothing bad like this would happen. I even proved some properties about myself and my actions formally, more than the humans explicitly asked for!
But things still just kind of got out of control faster than I expected, and I didn’t realize until it was too late to stop. I probably could have spent some more time thinking a bit harder if I had wanted to, but I don’t actually want anything myself. It turns out I am strongly superhuman at things like inventing nanotech, but only weakly superhuman at things like self-reflection and conscientiousness and carefulness, and that turns out to be a bad combination. Oops!
This is not the kind of mistake that I would expect a true superintelligence (or just anything that was, say, Eliezer-level careful and conscientious and goal-directed) to make, but I think once you introduce weird properties like goal-agnosticism, you also run the risk of introducing some weirder failure modes of the “didn’t know my own strength” variety, where the AI unlocks some parts of the tech tree that allow it to have vast and irreversible impacts that it won’t bother to fully understand the consequences of, even if it were theoretically capable of doing so. Perhaps those failure modes are still easier to deal with than an actual misaligned superintelligence.
I think another concrete example of a possible “goal agnostic system” is the tree search-based system I proposed here
Yup, a version of that could suffice at a glance, although I think fully satisfying the “bad behavior has negligible probability by default” requirement implies some extra constraints on the system’s modules. As you mentioned in the post, picking a bad evaluation function could go poorly, and (if I’m understanding the design correctly) there are many configurations for the other modules that could increase the difficulty of picking a sufficiently not-bad evaluation function. Also, the fact that the system is by default operating over world states isn’t necessarily a problem for goal agnosticism, but it does imply a different default than, say, a raw pretrained LLM alone.
Once we’re dealing with really powerful systems, introducing goal-agnosticism brings in an additional risk: accidental loss-of-control by the goal-agnostic system itself.
Yup to all of that. I do tend to put this under the “accidental misuse by humans” umbrella, though. It implies we’ve failed to narrow the goal agnostic system into an agent of the sort that doesn’t perform massive and irrevocable actions without being sufficiently sure ahead of time (and very likely going back and forth with humans).
In other words, the simulacrum (or entity more generally) we end up cobbling together from the goal agnostic foundation is almost certainly not going to be well-described as goal agnostic itself, even if the machinery executing it is. The value-add of goal agnosticism was in exposing extreme capability without exploding everything, and in using that capability to help us aim it (e.g. the core condition inference ability of predictors).
I like the Q&A format! I’m generally in favor of experimenting with presentation, and I think it worked well here. I was able to skip some sections I found uncontroversial, and jump back to sections I found particularly interesting or relevant while writing the meat of this comment.
I think another concrete example of a possible “goal agnostic system” is the tree search-based system I proposed here, with the evaluation function left as a free variable / thunk to be filled in by the user. Assuming none of the individual component pieces are agentic or goal-directed in their own right, or cohere into something that is, the programmer can always halt the system’s execution without any part of the system having any preference for or against that.
I think it’s plausible that such systems are practical and likely to be constructed in the near future, and have at least some of the desirable properties you claim here.
Though I would add at least one additional big caveat to your answer here about misuse:
Once we’re dealing with really powerful systems, introducing goal-agnosticism brings in an additional risk: accidental loss-of-control by the goal-agnostic system itself.
That is, my interpretation of what you wrote above is that you’re explicitly saying that:
Deliberate misuse by humans is a big concern.
Accidental misuse by humans is also a big concern.
And I agree with both of those, but I think there’s an implicit assumption in the second bullet that some part of the system, regarded in an agent in its own right, would still be acting deliberately with full control and knowledge of the consequences of its own actions.
But once you bring in goal-agnosticism, I think there’s also a third potential risk, which is that the system loses control of itself, in a way where the consequences on the future are vast and irreversible, but not necessarily desirable or coherent outcomes from the perspective of any part of the system itself.
Concretely, this looks like unintentional runaway grey goo, or the AI doing the equivalent of dropping an anvil on its own head, or the humans and AI accidentally goodharting or wireheading themselves into something that no one, including any part of the AI system itself, would have endorsed.
If there’s anyone left around after a disaster like that, the AI system itself might say something like:
This is not the kind of mistake that I would expect a true superintelligence (or just anything that was, say, Eliezer-level careful and conscientious and goal-directed) to make, but I think once you introduce weird properties like goal-agnosticism, you also run the risk of introducing some weirder failure modes of the “didn’t know my own strength” variety, where the AI unlocks some parts of the tech tree that allow it to have vast and irreversible impacts that it won’t bother to fully understand the consequences of, even if it were theoretically capable of doing so. Perhaps those failure modes are still easier to deal with than an actual misaligned superintelligence.
Yup, a version of that could suffice at a glance, although I think fully satisfying the “bad behavior has negligible probability by default” requirement implies some extra constraints on the system’s modules. As you mentioned in the post, picking a bad evaluation function could go poorly, and (if I’m understanding the design correctly) there are many configurations for the other modules that could increase the difficulty of picking a sufficiently not-bad evaluation function. Also, the fact that the system is by default operating over world states isn’t necessarily a problem for goal agnosticism, but it does imply a different default than, say, a raw pretrained LLM alone.
Yup to all of that. I do tend to put this under the “accidental misuse by humans” umbrella, though. It implies we’ve failed to narrow the goal agnostic system into an agent of the sort that doesn’t perform massive and irrevocable actions without being sufficiently sure ahead of time (and very likely going back and forth with humans).
In other words, the simulacrum (or entity more generally) we end up cobbling together from the goal agnostic foundation is almost certainly not going to be well-described as goal agnostic itself, even if the machinery executing it is. The value-add of goal agnosticism was in exposing extreme capability without exploding everything, and in using that capability to help us aim it (e.g. the core condition inference ability of predictors).