These are interesting anecdotes but it feels like they could just as easily be used to argue for the opposite conclusion.
That is, your frame here is something like “planning is hard therefore you should distrust alignment plans”.
But you could just as easily frame this as “abstract reasoning about unfamiliar domains is hard therefore you should distrust doom arguments”.
Also, the second section makes an argument in favor of backchaining. But that seems to contradict the first section, in which people tried to backchain and it went badly. The best way for them to make progress would have been to play around with a bunch of possibilities, which is closer to forward-chaining.
(And then you might say: well, in this context, they only got one shot at the problem. To which I’d say: okay, so the most important intervention is to try to have more shots on the problem. Which isn’t clearly either forward-chaining or back-chaining, but probably closer to the former.)
These are interesting anecdotes but it feels like they could just as easily be used to argue for the opposite conclusion.
That is, your frame here is something like “planning is hard therefore you should distrust alignment plans”.
But you could just as easily frame this as “abstract reasoning about unfamiliar domains is hard therefore you should distrust doom arguments”.
That doesn’t sound right to me.
The reported observation is not just that these particular people people failed at a planning / reasoning task. The reported observation is that they repeatedly made optimistic miscalibrated assumptions, because those assumptions supported a plan.
There’s a more specific reasoning error that’s being posited, beyond “people are often wrong when trying to reason about abstract domains without feedback”. Something like “people will anchor on ideas, if those ideas are necessary for the success of a plan, and they don’t see an alternative plan.”
If that posit is correct, that’s not just an update of “reasoning abstractly is hard and we should widen our confidence intervals / be more uncertain”. We should update to having a much higher evidential bar for the efficacy of plans.
I had a second half of this essay that felt like it was taking too long to pull together and I wasn’t quite sure who I was arguing with. I decided I’d probably try to make it a second post. I generally agree it’s not that obvious what lessons to take.
The beginning of the second-half/next-post was something like:
There’s an age-old debate about AI existential safety, which I might summarize as the viewpoints:
1. “We only get one critical try, and most alignment research dodges the hard part of the problem, with wildly optimistic assumptions.”
vs
2. “It is basically impossible to make progress on remote, complex problems on your first try. So, we need to somehow factor the problem into something we can make empirical progress on.”
I started out mostly thinking through lens #1. I’ve updated that, actually, both views are may be “hair on fire” levels of important. I have some frustrations with both some doomer-y people who seem resistant to incorporating lens #2, and with people who seem to (in practice) be satisfied with “well, iterative empiricism seems tractable, and we don’t super need to incorporate frame #1)
I am interested in both:
trying to build “engineering feedback loops” that more accurately represent the final problem as best we can, and then iterating on both “solving representative problems against our current best engineered benchmarks” while also “continuing to build better benchmarks. (Automating Auditing and Model Organisms of Misalignment seem like attempts at this)
trying to develop training regimens that seem like they should help people plan better in Low-Feedback-Domains, which includes theoretic work, and empirical research that’s trying to keep their eye on the longterm ball better, and the invention of benchmarks a la previous bullet.
That is, your frame here is something like “planning is hard therefore you should distrust alignment plans”.
But you could just as easily frame this as “abstract reasoning about unfamiliar domains is hard therefore you should distrust doom arguments”.
I think received wisdom in cryptography is “don’t roll your own crypto system”. I think this comes from a bunch of overconfident people doing this and then other people discovering major flaws in what they did, repeatedly.
The lesson is not “Reasoning about a crypto system you haven’t built yet is hard, and therefore it’s equally reasonable to say ‘a new system will work well’ and ‘a new system will work badly’.” Instead, it’s “Your new system will probably work badly.”
I think the underlying model is that there are lots of different ways for your new crypto system to be flawed, and you have to get all of them right, or else the optimizing intelligence of your rivals (ideally) or the malware industry (less ideally) will find the security hole and exploit it. If there are ten things you need to get right, and you have a 30% chance of screwing up each one, then the chance of complete success is 2.8%. Therefore, one could say that, if there’s a general fog of “It’s hard to reason about these things before building them, such that it’s hard to say in advance that the chance of failure in each thing is below 30%”, then that points asymmetrically towards overall failure.
I think Raemon’s model (pretty certainly Eliezer’s) is indeed that an alignment plan is, in large parts, like a security system in that there are lots of potential weaknesses any one of which could torpedo the whole system, and those weaknesses will be sought out by an optimizing intelligence. Perhaps your model is different?
Also, the second section makes an argument in favor of backchaining. But that seems to contradict the first section, in which people tried to backchain and it went badly.
This didn’t come across in the post, but – I think people in the experiment were mostly doing things closer to (simulated) forward chaining, and then getting stuck, and then generating the questionable assumptions. (which is also what I tended to do when I first started this experiment).
An interesting thing I learned is that “look at the board and think without fiddling around” is actually a useful skill to have even when I’m doing the more openended “solve it however seems best.” It’s easier to notice now when I’m fiddling around pointlessly instead of actually doing useful cognitive work.
But you could just as easily frame this as “abstract reasoning about unfamiliar domains is hard therefore you should distrust doom arguments”.
But doesn’t this argument hold with the opposite conclusion, too? E.g. “abstract reasoning about unfamiliar domains is hard therefore you should distrust arguments about good-by-default worlds”.
No, failure in the face of strong active hostile optimization pressure is different from failure when acting to overcome challenges in a neutral universe that just obeys physical laws. Building prisons is a different challenge from building bridges. Especially in the case where you know that there will be agents both inside and outside the prison that very much want to get through. Security is inherently a harder problem. A puzzle which has been designed to be challenging to solve and to have a surprising unintuitive solution is thus a better match for designing a security system. I bet you’d have much higher success rate of ‘first plans’ if the game were one of those bridge building games.
abstract reasoning about unfamiliar domains is hard therefore you should distrust doom arguments
Is a fair characterization of those results. So would be the inverse, “abstract reasoning about unfamiliar domains is hard therefore you should distrust AI success arguments”.
I think both are very true, and so we should distrust both. We simply don’t know.
I think the conclusion taken,
planning is hard therefore you should distrust alignment plans
Is also valid and true.
People just aren’t as smart as we’d like to think we are, particularly in reasoning about complex and unfamiliar domains. So both our plans and evaluations of them tend to be more untrustworthy than we’d like to think. Planning and reasoning require way more collective effort than we’d like to imagine. Careful studies of both individual reasoning in lab tasks and historical examples support this conclusion.
One major reason for this miscalibration is the motivated reasoning effect. We tend to believe what feels good (predicts local reward). Overestimating our reasoning abilities is one such belief among vary many examples of motivated reasoning.
So, I do think definitely I’ve got some confirmation bias here – I know because the first thing I thought when I saw was “man this sure looks like the thing Eliezer was complaining about” and it was awhile later, thinking it through, that was like “this does seem like it should make you really doomy about any agent-foundations-y plans, or other attempts to sidestep modern ML and cut towards ‘getting the hard problem right on the first try.’”
I did (later) think about that a bunch and integrate it into the post.
I don’t know whether I think it’s reasonable to say “it’s additionally confirmation-bias-indicative that the post doesn’t talk about general doom arguments.” As Eli says, the post is mostly observing a phenonenon that seems more about planmaking than general reasoning.
(fwiw my own p(doom) is more like ‘I dunno man, somewhere between 10% and 90%, and I’d need to see a lot of things going concretely right before my emotional center of mass shifted below 50%’)
Thanks for clarifying. I do agree with the broader point that one should have a sort of radical uncertainty about (e.g.) a post AGI world. I’m not sure I agree it’s a big issue to leave that out of any given discussion though, since it shifts probability mass from any particular describable outcome to the big “anything can happen” area.
(This might be what people mean by “Knightian uncertainty”?)
These are interesting anecdotes but it feels like they could just as easily be used to argue for the opposite conclusion.
That is, your frame here is something like “planning is hard therefore you should distrust alignment plans”.
But you could just as easily frame this as “abstract reasoning about unfamiliar domains is hard therefore you should distrust doom arguments”.
Also, the second section makes an argument in favor of backchaining. But that seems to contradict the first section, in which people tried to backchain and it went badly. The best way for them to make progress would have been to play around with a bunch of possibilities, which is closer to forward-chaining.
(And then you might say: well, in this context, they only got one shot at the problem. To which I’d say: okay, so the most important intervention is to try to have more shots on the problem. Which isn’t clearly either forward-chaining or back-chaining, but probably closer to the former.)
That doesn’t sound right to me.
The reported observation is not just that these particular people people failed at a planning / reasoning task. The reported observation is that they repeatedly made optimistic miscalibrated assumptions, because those assumptions supported a plan.
There’s a more specific reasoning error that’s being posited, beyond “people are often wrong when trying to reason about abstract domains without feedback”. Something like “people will anchor on ideas, if those ideas are necessary for the success of a plan, and they don’t see an alternative plan.”
If that posit is correct, that’s not just an update of “reasoning abstractly is hard and we should widen our confidence intervals / be more uncertain”. We should update to having a much higher evidential bar for the efficacy of plans.
I had a second half of this essay that felt like it was taking too long to pull together and I wasn’t quite sure who I was arguing with. I decided I’d probably try to make it a second post. I generally agree it’s not that obvious what lessons to take.
The beginning of the second-half/next-post was something like:
I am interested in both:
trying to build “engineering feedback loops” that more accurately represent the final problem as best we can, and then iterating on both “solving representative problems against our current best engineered benchmarks” while also “continuing to build better benchmarks. (Automating Auditing and Model Organisms of Misalignment seem like attempts at this)
trying to develop training regimens that seem like they should help people plan better in Low-Feedback-Domains, which includes theoretic work, and empirical research that’s trying to keep their eye on the longterm ball better, and the invention of benchmarks a la previous bullet.
I think received wisdom in cryptography is “don’t roll your own crypto system”. I think this comes from a bunch of overconfident people doing this and then other people discovering major flaws in what they did, repeatedly.
The lesson is not “Reasoning about a crypto system you haven’t built yet is hard, and therefore it’s equally reasonable to say ‘a new system will work well’ and ‘a new system will work badly’.” Instead, it’s “Your new system will probably work badly.”
I think the underlying model is that there are lots of different ways for your new crypto system to be flawed, and you have to get all of them right, or else the optimizing intelligence of your rivals (ideally) or the malware industry (less ideally) will find the security hole and exploit it. If there are ten things you need to get right, and you have a 30% chance of screwing up each one, then the chance of complete success is 2.8%. Therefore, one could say that, if there’s a general fog of “It’s hard to reason about these things before building them, such that it’s hard to say in advance that the chance of failure in each thing is below 30%”, then that points asymmetrically towards overall failure.
I think Raemon’s model (pretty certainly Eliezer’s) is indeed that an alignment plan is, in large parts, like a security system in that there are lots of potential weaknesses any one of which could torpedo the whole system, and those weaknesses will be sought out by an optimizing intelligence. Perhaps your model is different?
This didn’t come across in the post, but – I think people in the experiment were mostly doing things closer to (simulated) forward chaining, and then getting stuck, and then generating the questionable assumptions. (which is also what I tended to do when I first started this experiment).
An interesting thing I learned is that “look at the board and think without fiddling around” is actually a useful skill to have even when I’m doing the more openended “solve it however seems best.” It’s easier to notice now when I’m fiddling around pointlessly instead of actually doing useful cognitive work.
But doesn’t this argument hold with the opposite conclusion, too? E.g. “abstract reasoning about unfamiliar domains is hard therefore you should distrust arguments about good-by-default worlds”.
Yepp, all of these arguments can weigh in many different directions, depending on your background beliefs. That’s my point.
No, failure in the face of strong active hostile optimization pressure is different from failure when acting to overcome challenges in a neutral universe that just obeys physical laws. Building prisons is a different challenge from building bridges. Especially in the case where you know that there will be agents both inside and outside the prison that very much want to get through. Security is inherently a harder problem. A puzzle which has been designed to be challenging to solve and to have a surprising unintuitive solution is thus a better match for designing a security system. I bet you’d have much higher success rate of ‘first plans’ if the game were one of those bridge building games.
Yes to your first point. I think that
Is a fair characterization of those results. So would be the inverse, “abstract reasoning about unfamiliar domains is hard therefore you should distrust AI success arguments”.
I think both are very true, and so we should distrust both. We simply don’t know.
I think the conclusion taken,
Is also valid and true.
People just aren’t as smart as we’d like to think we are, particularly in reasoning about complex and unfamiliar domains. So both our plans and evaluations of them tend to be more untrustworthy than we’d like to think. Planning and reasoning require way more collective effort than we’d like to imagine. Careful studies of both individual reasoning in lab tasks and historical examples support this conclusion.
One major reason for this miscalibration is the motivated reasoning effect. We tend to believe what feels good (predicts local reward). Overestimating our reasoning abilities is one such belief among vary many examples of motivated reasoning.
I don’t think it’s unreasonable to distrust doom arguments for exactly this reason?
Yes, I’m saying it’s a reasonable conclusion to draw, and the fact that it isn’t drawn here is indicative of a kind of confirmation bias.
So, I do think definitely I’ve got some confirmation bias here – I know because the first thing I thought when I saw was “man this sure looks like the thing Eliezer was complaining about” and it was awhile later, thinking it through, that was like “this does seem like it should make you really doomy about any agent-foundations-y plans, or other attempts to sidestep modern ML and cut towards ‘getting the hard problem right on the first try.’”
I did (later) think about that a bunch and integrate it into the post.
I don’t know whether I think it’s reasonable to say “it’s additionally confirmation-bias-indicative that the post doesn’t talk about general doom arguments.” As Eli says, the post is mostly observing a phenonenon that seems more about planmaking than general reasoning.
(fwiw my own p(doom) is more like ‘I dunno man, somewhere between 10% and 90%, and I’d need to see a lot of things going concretely right before my emotional center of mass shifted below 50%’)
Thanks for clarifying. I do agree with the broader point that one should have a sort of radical uncertainty about (e.g.) a post AGI world. I’m not sure I agree it’s a big issue to leave that out of any given discussion though, since it shifts probability mass from any particular describable outcome to the big “anything can happen” area. (This might be what people mean by “Knightian uncertainty”?)