HRAD has always been about deconfusion (though I agree we did a terrible job of articulating this), not about trying to solve all of philosophy or “write down a perfectly aligned AGI from scratch”. The spirit wasn’t ‘we should dutifully work on these problems because they’re Important-sounding and Philosophical’; from my perspective, it was more like ‘we tried to write down a sketch of how to align an AGI, and immediately these dumb issues with self-reference and counterfactuals and stuff cropped up, so we tried to get those out of the way fast so we could go back to sketching how to aim an AGI at intended targets’.
I think that the issue is that I have a mental model of this process you describe that summarize it as “you need to solve a lot of philosophical issues for it to work”, and so that’s what I get by default when I query for that agenda. Still, I always had the impression that this line of work focused more on how to build a perfectly rational AGI than on building an aligned one. Can you explain me why that’s inaccurate?
From my perspective, the biggest reason MIRI started diversifying approaches away from our traditional focus was shortening timelines, where we still felt that “conceptual” progress was crucial, and still felt that marginal progress on the Agent Foundations directions would be useful; but we now assigned more probability to ‘there may not be enough time to finish the core AF stuff’, enough to want to put a lot of time into other problems too.
Yeah, I think this is a pretty common perspective on that work from outside MIRI. That’s my take (that there isn’t enough time to solve all of the necessary components) and the one I’ve seen people use in discussing MIRI multiple time.
Actually, I’m not sure how to categorize MIRI’s work using your conceptual vs. applied division. I’d normally assume “conceptual”, because our work is so far away from prosaic alignment; but you also characterize applied alignment research as being about “experimentally testing these ideas [from conceptual alignment]”, which sounds like the 2017-initiated lines of research we described in our 2018 update. If someone is running software experiments to test ideas about “Seeking entirely new low-level foundations for optimization” outside the current ML paradigm, where does that fall?
A really important point is that the division isn’t meant to split researchers themselves but research. So the experiment part would be applied alignment research and the rest conceptual alignment research. What’s interesting is that this is a good example of applied alignment research that doesn’t have the benefits I mention of more prosaic applied alignment research: being publishable at big ML/AI conferences, being within an accepted paradigm of modern AI...
Prosaic AGI alignment and “write down a perfectly aligned AGI from scratch” both seem super doomed to me, compared to approaches that are neither prosaic nor perfectly-neat-and-tidy. Where does research like that fall?
I would say that the non-prosaic approaches require at least some conceptual alignment research (because the research can’t be done fully inside current paradigms of ML and AI), but probably encompass some applied research. Maybe Steve’s work is a good example, with a proposal split of two of his posts in this comment.
Still, I always had the impression that this line of work focused more on how to build a perfectly rational AGI than on building an aligned one. Can you explain me why that’s inaccurate?
I don’t know what you mean by “perfectly rational AGI”. (Perfect rationality isn’t achievable, rationality-in-general is convergently instrumental, and rationality is insufficient for getting good outcomes. So why would that be the goal?)
I think of the basic case for HRAD this way:
We seem to be pretty confused about a lot of aspects of optimization, reasoning, decision-making, etc. (Embedded Agency is talking about more or less the same set of questions as HRAD, just with subsystem alignment added to the mix.)
If we were less confused, it might be easier to steer toward approaches to AGI that make it easier to do alignment work like ‘understand what cognitive work the system is doing internally’, ‘ensure that none of the system’s compute is being used to solve problems we don’t understand / didn’t intend’, ‘ensure that the amount of quality-adjusted thinking the system is putting into the task at hand is staying within some bound’, etc.
These approaches won’t look like decision theory, but being confused about basic ground-floor things like decision theory is a sign that you’re likely not in an epistemic position to efficiently find such approaches, much like being confused about how/whether chess is computable is a sign that you’re not in a position to efficiently steer toward good chess AI designs.
I think that the issue is that I have a mental model of this process you describe that summarize it as “you need to solve a lot of philosophical issues for it to work”, and so that’s what I get by default when I query for that agenda. Still, I always had the impression that this line of work focused more on how to build a perfectly rational AGI than on building an aligned one. Can you explain me why that’s inaccurate?
Yeah, I think this is a pretty common perspective on that work from outside MIRI. That’s my take (that there isn’t enough time to solve all of the necessary components) and the one I’ve seen people use in discussing MIRI multiple time.
A really important point is that the division isn’t meant to split researchers themselves but research. So the experiment part would be applied alignment research and the rest conceptual alignment research. What’s interesting is that this is a good example of applied alignment research that doesn’t have the benefits I mention of more prosaic applied alignment research: being publishable at big ML/AI conferences, being within an accepted paradigm of modern AI...
I would say that the non-prosaic approaches require at least some conceptual alignment research (because the research can’t be done fully inside current paradigms of ML and AI), but probably encompass some applied research. Maybe Steve’s work is a good example, with a proposal split of two of his posts in this comment.
OK, thanks for the clarifications!
I don’t know what you mean by “perfectly rational AGI”. (Perfect rationality isn’t achievable, rationality-in-general is convergently instrumental, and rationality is insufficient for getting good outcomes. So why would that be the goal?)
I think of the basic case for HRAD this way:
We seem to be pretty confused about a lot of aspects of optimization, reasoning, decision-making, etc. (Embedded Agency is talking about more or less the same set of questions as HRAD, just with subsystem alignment added to the mix.)
If we were less confused, it might be easier to steer toward approaches to AGI that make it easier to do alignment work like ‘understand what cognitive work the system is doing internally’, ‘ensure that none of the system’s compute is being used to solve problems we don’t understand / didn’t intend’, ‘ensure that the amount of quality-adjusted thinking the system is putting into the task at hand is staying within some bound’, etc.
These approaches won’t look like decision theory, but being confused about basic ground-floor things like decision theory is a sign that you’re likely not in an epistemic position to efficiently find such approaches, much like being confused about how/whether chess is computable is a sign that you’re not in a position to efficiently steer toward good chess AI designs.