So I’ve been thinking about this for a while, and I think I disagree with what I understand of your perspective. Which might obviously mean I misunderstand your perspective.
What I think I understand is that you judge safety research directions based on how well they could work on an evolutionary process like the one that created humans. But for me, the most promising approach to AGI is based on local search, which differs a bit from evolutionary process. I don’t really see a reason to consider evolutionary processes instead of local search, and even then, the specific approach of evolution for humans is probably far too specific as a test bench.
This matters because problems for one are not problems for the other. For example, one way to mess with an evolutionary process is to find way for everything to survive and reproduce/disseminate. Technology in general did that for humans, which means the evolutionary pressure decreased as technology evolved. But that’s not a problem for local search, since at each step there will be only one next program.
On the other hand, local search might be dangerous because of things like gradient hacking. And they don’t make sense for evolutionary processes.
In conclusion, I feel for the moment that backchaining to local search is a better heuristic for judging safety research directions. But I’m curious about where our disagreement lies on this issue.
One source of our disagreement: I would describe evolution as a type of local search. The difference is that it’s local with respect to the parameters of a whole population, rather than an individual agent. So this does introduce some disanalogies, but not particularly significant ones (to my mind). I don’t think it would make much difference to my heuristic if we imagined that humans had evolved via gradient descent over our genes instead.
In other words, I like the heuristic of backchaining to local search, and I think of it as a subset of my heuristic. The thing it’s missing, though, is that it doesn’t tell you which approaches will actually scale up to training regimes which are incredibly complicated, applied to fairly intelligent agents. For example, impact penalties make sense in a local search context for simple problems. But to evaluate whether they’ll work for AGIs, you need to apply them to massively complex environments. So my intuition is that, because I don’t know how to apply them to the human ancestral environment, we also won’t know how to apply them to our AGIs’ training environments.
Similarly, when I think about MIRI’s work on decision theory, I really have very little idea how to evaluate it in the context of modern machine learning. Are decision theories the type of thing which AIs can learn via local search? Seems hard to tell, since our AIs are so far from general intelligence. But I can reason much more easily about the types of decision theories that humans have, and the selective pressures that gave rise to them.
As a third example, my heuristic endorses Debate due to a high-level intuition about how human reasoning works, in addition to a low-level intuition about how it can arise via local search.
So if I try to summarize your position, it’s something like: backchain to local search for simple and single-AI cases, and then think about aligning humans for the scaled and multi-agents version? That makes much more sense, thanks!
I also definitely see why your full heuristic doesn’t feel immediately useful to me: because I mostly focus on the simple and single-AI case. But I’ve been thinking more and more (in part thanks to your writing) that I should allocate more thinking time to the more general case. I hope your heuristic will help me there.
Cool, glad to hear it. I’d clarify the summary slightly: I think all safety techniques should include at least a rough intuition for why they’ll work in the scaled-up version, even when current work on them only applies them to simple AIs. (Perhaps this was implicit in your summary already, I’m not sure.)
So I’ve been thinking about this for a while, and I think I disagree with what I understand of your perspective. Which might obviously mean I misunderstand your perspective.
What I think I understand is that you judge safety research directions based on how well they could work on an evolutionary process like the one that created humans. But for me, the most promising approach to AGI is based on local search, which differs a bit from evolutionary process. I don’t really see a reason to consider evolutionary processes instead of local search, and even then, the specific approach of evolution for humans is probably far too specific as a test bench.
This matters because problems for one are not problems for the other. For example, one way to mess with an evolutionary process is to find way for everything to survive and reproduce/disseminate. Technology in general did that for humans, which means the evolutionary pressure decreased as technology evolved. But that’s not a problem for local search, since at each step there will be only one next program.
On the other hand, local search might be dangerous because of things like gradient hacking. And they don’t make sense for evolutionary processes.
In conclusion, I feel for the moment that backchaining to local search is a better heuristic for judging safety research directions. But I’m curious about where our disagreement lies on this issue.
One source of our disagreement: I would describe evolution as a type of local search. The difference is that it’s local with respect to the parameters of a whole population, rather than an individual agent. So this does introduce some disanalogies, but not particularly significant ones (to my mind). I don’t think it would make much difference to my heuristic if we imagined that humans had evolved via gradient descent over our genes instead.
In other words, I like the heuristic of backchaining to local search, and I think of it as a subset of my heuristic. The thing it’s missing, though, is that it doesn’t tell you which approaches will actually scale up to training regimes which are incredibly complicated, applied to fairly intelligent agents. For example, impact penalties make sense in a local search context for simple problems. But to evaluate whether they’ll work for AGIs, you need to apply them to massively complex environments. So my intuition is that, because I don’t know how to apply them to the human ancestral environment, we also won’t know how to apply them to our AGIs’ training environments.
Similarly, when I think about MIRI’s work on decision theory, I really have very little idea how to evaluate it in the context of modern machine learning. Are decision theories the type of thing which AIs can learn via local search? Seems hard to tell, since our AIs are so far from general intelligence. But I can reason much more easily about the types of decision theories that humans have, and the selective pressures that gave rise to them.
As a third example, my heuristic endorses Debate due to a high-level intuition about how human reasoning works, in addition to a low-level intuition about how it can arise via local search.
So if I try to summarize your position, it’s something like: backchain to local search for simple and single-AI cases, and then think about aligning humans for the scaled and multi-agents version? That makes much more sense, thanks!
I also definitely see why your full heuristic doesn’t feel immediately useful to me: because I mostly focus on the simple and single-AI case. But I’ve been thinking more and more (in part thanks to your writing) that I should allocate more thinking time to the more general case. I hope your heuristic will help me there.
Cool, glad to hear it. I’d clarify the summary slightly: I think all safety techniques should include at least a rough intuition for why they’ll work in the scaled-up version, even when current work on them only applies them to simple AIs. (Perhaps this was implicit in your summary already, I’m not sure.)