My thoughts: we can’t really expect to prove something like “this ai will be beneficial”. However, relying on empiricism to test our algorithms is very likely to fail, because it’s very plausible that there’s a discontinuity in behavior around the region of human-level generality of intelligence (specifically as we move to the upper end, where the system can understand things like the whole training regime and its goal systems). So I don’t know how to make good guesses about the behavior of very capable systems except through mathematical analysis.
There are two overlapping traditions in machine learning. There’s a heavy empirical tradition, in which experimental methodology is used to judge the effectiveness of algorithms along various metrics. Then, there’s machine learning theory (computational learning theory), in which algorithms are analyzed mathematically and properties are proven. This second tradition seems far more applicable to questions of safety.
(But we should not act as if we only have one historical example of a successful scientific field to try and generalize from. We can also look at how other fields accomplish difficult things, especially in the face of significant risks.)
I don’t think you need to posit a discontinuity to expect tests to occasionally fail.
I suspect the crux is more about how bad a single failure of a sufficiently advanced AI is likely to be.
I’ll admit I don’t feel like I really understand the perspective of people who seem to think we’ll be able to learn how to do alignment via trial-and-error (i.e. tolerating multiple failures). Here are some guesses why people might hold that sort of view:
We’ll develop AI in a well-designed box, so we can do a lot of debugging and stress testing.
counter-argument: but the concern is about what happens at deployment time
We’ll deploy AI in a box, too then
counter: seems like that entails a massive performance hit (but it’s not clear if that’s actually the case)
We’ll have other “AI police” to stop any “evil AIs” that “go rogue” (just like we have for people).
counter: where did the AI police come from, and why can’t they go rogue as well?
The “AI police” can just be the rest of the AIs in the world ganging up on anyone who goes rogue.
counter: this seems to be assuming the “corrigibility as basin of attraction” argument (which has no real basis beyond intuition ATM, AFAIK) at the level of the population of agents.
A single failure isn’t likely to be that bad, it would take a series of unlikely failures to take a safe (e.g. “satiable”) AI and make it an insatiable “open ended optimizer AI”.
counter: we can’t assume that we can detect and correct failures, especially in real-world deployment scenarios where subagents might be created. So the failures may have time to compound. It also seems possible that a single failure is all that’s needed; this seems like an open question
OK I could go on, but I’d rather actually hear from anyone who has this view! :)
I hold this view; none of those are reasons for my view. The reason is much more simple—before x-risk level failures, we’ll see less catastrophic (but still potentially very bad) failures for the same underlying reason. We’ll notice this, understand it, and fix the issue.
(A crux I expect people to have is whether we’ll actually fix the issue or “apply a bandaid” that is only a superficial fix.)
Yeah, this is why I think some kind of discontinuity is important to my case. I expect different kinds of problems to arise with very very capable systems. So I don’t see why it makes sense to expect smaller problems to arise first which indicate the potential larger problems and allow people to avert them before they occur.
If a case could be made that all potential problems with very very capable systems could be expected to first arise in survivable forms in moderately capable systems, then I would see how the more empirical style of development could give rise to safe systems.
Can you elaborate on what kinds of problems you expect to arise pre vs. post discontinuity?
E.g. will we see “sinister stumbles” (IIRC this was Adam Gleave’s name for half-baked treacherous turns)? I think we will, FWIW.
Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)
How about mesa-optimization? (I think we already see qualitatively similar phenomena, but my idea of this doesn’t emphasize the “optimization” part.)
Jessica’s posts about MIRI vs. Paul’s views made it seem like MIRI might be quite concerned about the first AGI arising via mesa-optimization. This seems likely to me, and would also be a case where I’d expect, unless ML becomes “woke” to mesa-optimization (which seems likely to happen, and not too hard to make happen, to me), we’d see something that *looks* like a discontinuity, but is *actually* more like “the same reason”.
Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)
This in particular doesn’t match my model. Quoting some relevant bits from Embedded Agency:
So I’m not talking about agents who know their own actions because I think there’s going to be a big problem with intelligent machines inferring their own actions in the future. Rather, the possibility of knowing your own actions illustrates something confusing about determining the consequences of your actions—a confusion which shows up even in the very simple case where everything about the world is known and you just need to choose the larger pile of money.
[...]
But it’s not that I’m imagining real-world embedded systems being “too Bayesian” and this somehow causing problems, if we don’t figure out what’s wrong with current models of rational agency. It’s certainly not that I’m imagining future AI systems being written in second-order logic! In most cases, I’m not trying at all to draw direct lines between research problems and specific AI failure modes.
What I’m instead thinking about is this: We sure do seem to be working with the wrong basic concepts today when we try to think about what agency is, as seen by the fact that these concepts don’t transfer well to the more realistic embedded framework.
I also predict that there will be types of failure we will not notice, or will misinterpret. It seems fairly likely to me proto-AGI (i.e. AI that could autonomously learn to become AGI within <~10yrs of acting in the real world) is deployed and creates proto-AGI subagents, some of which we don’t become aware of (e.g. because accidental/incidental/deliberate steganography) and/or are unable to keep track of. And then those continue to survive and reproduce, etc… I guess this only seems plausible if the proto-AGI has a hospitable environment (like the internet, human brains/memes) and/or means of reproduction in the real world.
A very similar problem would be a form of longer-term “seeding”, where an AI (at any stage) with a sufficiently advanced model of the world and long horizons discovers strategies for increasing the chances (“at the margin”) that its values dominate in the long-term future. With my limited knowledge of physics, I imagine there might be ways of doing this just by beaming signals into space in a way calculated to influence/spur the development of life/culture in other parts of the galaxy.
I notice a lot of what I said above makes less sense if you think of AIs as having a similar skill profile to humans, but I think we agree that AIs might be much more advanced than people in some respects while still falling short of AGI because of weaknesses in other areas.
That observation also cuts against the argument you make about warning signs, I think, as it suggests that we might significantly underestimate an AIs (e.g. vastly superhuman) skill in some areas, if it still fails at some things we think are easy. To pull an example (not meant to be realistic) out of a hat: we might have AIs that can’t carry on a conversations, but can implement a very sophisticated covert world domination strategy.
It seems fairly likely to me proto-AGI (i.e. AI that could autonomously learn to become AGI within <~10yrs of acting in the real world) is deployed and creates proto-AGI subagents, some of which we don’t become aware of (e.g. because accidental/incidental/deliberate steganography) and/or are unable to keep track of. And then those continue to survive and reproduce, etc…
Now I’m wondering if it makes sense to model past or present cognitive-cultural information processes in a similar fashion. Memetic and cultural evolutions are a thing and any agentlike processes that spawn could piggypack on our existing general intelligence architecture.
Yeah, I think it totally does! (and that’s a very interesting / “trippy” line of thought :D)
However, it does seem to me somewhat unlikely, since it does require fairly advanced intelligence, and I don’t think evolution is likely to have produced such advanced intelligence with us being totally unaware, whereas I think something about the way we train AI is more strongly selecting for “savant-like” intelligence, which is sort of what I’m imagining here. I can’t think of why I have that intuition OTTMH.
That observation also cuts against the argument you make about warning signs, I think, as it suggests that we might significantly underestimate an AIs (e.g. vastly superhuman) skill in some areas, if it still fails at some things we think are easy.
Nobody denies that AI is really good at extracting patterns out of statistical data (e.g. image classification, speech-to-text, and so on), even though AI is absolutely terrible at many “easy” things. This, and the linked comment from Eliezer, seem to be drastically underselling the competence of AI researchers. (I could imagine it happening with strong enough competitive pressures though.)
I also predict that there will be types of failure we will not notice, or will misinterpret. [...]
All of this assumes some very good long-term planning capabilities. I expect long-term planning to be one of the last capabilities that AI systems get. If I thought they would get them early, I’d be more worried about scenarios like these.
So I don’t take EY’s post as about AI researchers’ competence, as much as their incentives and levels of rationality and paranoia. It does include significant competitive pressures, which seems realistic to me.
I don’t think I’m underestimating AI researchers, either, but for a different reason… let me elaborate a bit: I think there are waaaaaay to many skills for us to hope to have a reasonable sense of what an AI is actually good at. By skills I’m imagining something more like options, or having accurate generalized value functions (GVFs), than tasks.
Regarding long-term planning, I’d factor this into 2 components:
1) having a good planning algorithm
2) having a good world model
I think the way long-term planning works is that you do short-term planning in a good hierarchical world model. I think AIs will have vastly superhuman planning algorithms (arguably, they already do), so the real bottleneck is the world-model.
I don’t think its necessary to have a very “complete” world-model (i.e. enough knowledge to look smart to a person) in order to find “steganographic” long-term strategies like the ones I’m imagining.
I also don’t think it’s even necessary to have anything that looks very much like a world-model. The AI can just have a few good GVFs.… (i.e. be some sort of savant).
I don’t think the only alternative to proof is empiricism. Lots of people reason about evolutionary biology/psychology with neither proof nor empiricism. The mesa optimizers paper involves neither proof nor empiricism.
it’s very plausible that there’s a discontinuity in behavior around the region of human-level generality of intelligence (specifically as we move to the upper end, where the system can understand things like the whole training regime and its goal systems)
You can also be empirical at that point though? I suppose you couldn’t be empirical if you expect an either an extremely fast takeoff (i.e. order one day or less) or an inability on our part to tell when the AI reaches human-level, but this seems overly pessimistic to me.
The mesa-optimizer paper, along with some other examples of important intellectual contributions to AI alignment, have two important properties:
They are part of a research program, not an end result. Rough intuitions can absolutely be a useful guide which (hopefully eventually) helps us figure out what mathematical results are possible and useful.
They primarily point at problems rather than solutions. Because (it seems to me) existential risk seems asymmetrically bad in comparison to potential technology upsides (large as upsides may be), I just have different standards of evidence for “significant risk” vs “significant good”. IE, an argument that there is a risk can be fairly rough and nonetheless be sufficient for me to “not push the button” (in a hypothetical where I could choose to turn on a system today). On the other hand, an argument that pushing the button is net positive has to be actually quite strong. I want there to be a small set of assumptions, each of which individually seem very likely to be true, which taken together would be a guarantee against catastrophic failure.
[This is an “or” condition—either one of those two conditions suffices for me to take vague arguments seriously.]
On the other hand, I agree with you that I set up a false dichotomy between proof and empiricism. Perhaps a better model would be a spectrum between “theory” and empiricism. Mathematical arguments are an extreme point of rigorous theory. Empiricism realistically comes with some amount of theory no matter what. And you could also ask for a “more of both” type approach, implying a 2d picture where they occupy separate dimensions.
Still, though, I personally don’t see much of a way to gain understanding about failure modes of very very capable systems using empirical observation of today’s systems. I especially don’t see an argument that one could expect all failure modes of very very capable systems to present themselves first in less-capable systems.
Because (it seems to me) existential risk seems asymmetrically bad in comparison to potential technology upsides (large as upsides may be), I just have different standards of evidence for “significant risk” vs “significant good”.
This is a normative argument, not an empirical one. The normative position seems reasonable to me, though I’d want to think more about it (I haven’t because it doesn’t seem decision-relevant).
I especially don’t see an argument that one could expect all failure modes of very very capable systems to present themselves first in less-capable systems.
The quick version is that to the extent that the system is adversarially optimizing against you, it had to at some point learn that that was a worthwhile thing to do, which we could notice. (This is assuming that capable systems are built via learning; if not then who knows what’ll happen.)
I am confused about how the normative question isn’t decision-relevant here. Is it that I have a model where it is the relevant question, but you have one where it isn’t? To be hopefully clear: I’m applying this normative claim to argue that proof is needed to establish the desired level of confidence. That doesn’t mean direct proof of the claim “the AI will do good”, but rather of supporting claims, perhaps involving the learning-theoretic properties of the system (putting bounds on errors of certain kinds) and such.
It’s possible that this isn’t my true disagreement, because actually the question seems more complicated than just a question of how large potential downsides are if things go poorly in comparison to potential upsides if things go well. But some kind of analysis of the risks seems relevant here—if there weren’t such large downside risks, I would have lower standards of evidence for claims that things will go well.
The quick version is that to the extent that the system is adversarially optimizing against you, it had to at some point learn that that was a worthwhile thing to do, which we could notice. (This is assuming that capable systems are built via learning; if not then who knows what’ll happen.)
It sounds like we would have to have a longer discussion to resolve this. I don’t expect this to hit the mark very well, but here’s my reply to what I understand:
I don’t see how you can be confident enough of that view for it to be how you really want to check.
A system can be optimizing a fairly good proxy, so that at low levels of capability it is highly aligned, but this falls apart as the system becomes highly capable and figures out “hacks” around the “usual interpretation” of the proxy.
I also note that it seems like we disagree both about how useful proofs will be and about how useful empirical investigations will be (keeping in mind that those aren’t the only two things in the universe). I’m not sure which of those two disagreements is more important here.
To be hopefully clear: I’m applying this normative claim to argue that proof is needed to establish the desired level of confidence.
Under my model, it’s overwhelmingly likely that regardless of what we do AGI will be deployed with less than the desired level of confidence in its alignment. If I personally controlled whether or not AGI was deployed, then I’d be extremely interested in the normative claim. If I then agreed with the normative claim, I’d agree with:
proof is needed to establish the desired level of confidence. That doesn’t mean direct proof of the claim “the AI will do good”, but rather of supporting claims, perhaps involving the learning-theoretic properties of the system (putting bounds on errors of certain kinds) and such.
I don’t see how you can be confident enough of that view for it to be how you really want to check.
If I want >99% confidence, I agree that I couldn’t be confident enough in that argument.
A system can be optimizing a fairly good proxy, so that at low levels of capability it is highly aligned, but this falls apart as the system becomes highly capable and figures out “hacks” around the “usual interpretation” of the proxy.
Yeah, the hope here would be that the relevant decision-makers are aware of this dynamic (due to previous situations in which e.g. a recommender system optimized the fairly good proxy of clickthrough rate but this lead to “hacks” around the “usual interpretation”), and have some good reason to think that it won’t happen with the highly capable system they are planning to deploy.
I also note that it seems like we disagree both about how useful proofs will be and about how useful empirical investigations will be
Agreed. It also might be that we disagree on the tractability of proofs in addition to / instead of the utility of proofs.
My thoughts: we can’t really expect to prove something like “this ai will be beneficial”. However, relying on empiricism to test our algorithms is very likely to fail, because it’s very plausible that there’s a discontinuity in behavior around the region of human-level generality of intelligence (specifically as we move to the upper end, where the system can understand things like the whole training regime and its goal systems). So I don’t know how to make good guesses about the behavior of very capable systems except through mathematical analysis.
There are two overlapping traditions in machine learning. There’s a heavy empirical tradition, in which experimental methodology is used to judge the effectiveness of algorithms along various metrics. Then, there’s machine learning theory (computational learning theory), in which algorithms are analyzed mathematically and properties are proven. This second tradition seems far more applicable to questions of safety.
(But we should not act as if we only have one historical example of a successful scientific field to try and generalize from. We can also look at how other fields accomplish difficult things, especially in the face of significant risks.)
I don’t think you need to posit a discontinuity to expect tests to occasionally fail.
I suspect the crux is more about how bad a single failure of a sufficiently advanced AI is likely to be.
I’ll admit I don’t feel like I really understand the perspective of people who seem to think we’ll be able to learn how to do alignment via trial-and-error (i.e. tolerating multiple failures). Here are some guesses why people might hold that sort of view:
We’ll develop AI in a well-designed box, so we can do a lot of debugging and stress testing.
counter-argument: but the concern is about what happens at deployment time
We’ll deploy AI in a box, too then
counter: seems like that entails a massive performance hit (but it’s not clear if that’s actually the case)
We’ll have other “AI police” to stop any “evil AIs” that “go rogue” (just like we have for people).
counter: where did the AI police come from, and why can’t they go rogue as well?
The “AI police” can just be the rest of the AIs in the world ganging up on anyone who goes rogue.
counter: this seems to be assuming the “corrigibility as basin of attraction” argument (which has no real basis beyond intuition ATM, AFAIK) at the level of the population of agents.
A single failure isn’t likely to be that bad, it would take a series of unlikely failures to take a safe (e.g. “satiable”) AI and make it an insatiable “open ended optimizer AI”.
counter: we can’t assume that we can detect and correct failures, especially in real-world deployment scenarios where subagents might be created. So the failures may have time to compound. It also seems possible that a single failure is all that’s needed; this seems like an open question
OK I could go on, but I’d rather actually hear from anyone who has this view! :)
I hold this view; none of those are reasons for my view. The reason is much more simple—before x-risk level failures, we’ll see less catastrophic (but still potentially very bad) failures for the same underlying reason. We’ll notice this, understand it, and fix the issue.
(A crux I expect people to have is whether we’ll actually fix the issue or “apply a bandaid” that is only a superficial fix.)
Yeah, this is why I think some kind of discontinuity is important to my case. I expect different kinds of problems to arise with very very capable systems. So I don’t see why it makes sense to expect smaller problems to arise first which indicate the potential larger problems and allow people to avert them before they occur.
If a case could be made that all potential problems with very very capable systems could be expected to first arise in survivable forms in moderately capable systems, then I would see how the more empirical style of development could give rise to safe systems.
Can you elaborate on what kinds of problems you expect to arise pre vs. post discontinuity?
E.g. will we see “sinister stumbles” (IIRC this was Adam Gleave’s name for half-baked treacherous turns)? I think we will, FWIW.
Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)
How about mesa-optimization? (I think we already see qualitatively similar phenomena, but my idea of this doesn’t emphasize the “optimization” part.)
Jessica’s posts about MIRI vs. Paul’s views made it seem like MIRI might be quite concerned about the first AGI arising via mesa-optimization. This seems likely to me, and would also be a case where I’d expect, unless ML becomes “woke” to mesa-optimization (which seems likely to happen, and not too hard to make happen, to me), we’d see something that *looks* like a discontinuity, but is *actually* more like “the same reason”.
This in particular doesn’t match my model. Quoting some relevant bits from Embedded Agency:
This is also the topic of The Rocket Alignment Problem.
Interesting. Your crux seems good; I think it’s a crux for us. I expect things play out more like Eliezer predicts here: https://www.facebook.com/jefftk/posts/886930452142?comment_id=886983450932&comment_tracking=%7B%22tn%22%3A%22R%22%7D&hc_location=ufi
I also predict that there will be types of failure we will not notice, or will misinterpret. It seems fairly likely to me proto-AGI (i.e. AI that could autonomously learn to become AGI within <~10yrs of acting in the real world) is deployed and creates proto-AGI subagents, some of which we don’t become aware of (e.g. because accidental/incidental/deliberate steganography) and/or are unable to keep track of. And then those continue to survive and reproduce, etc… I guess this only seems plausible if the proto-AGI has a hospitable environment (like the internet, human brains/memes) and/or means of reproduction in the real world.
A very similar problem would be a form of longer-term “seeding”, where an AI (at any stage) with a sufficiently advanced model of the world and long horizons discovers strategies for increasing the chances (“at the margin”) that its values dominate in the long-term future. With my limited knowledge of physics, I imagine there might be ways of doing this just by beaming signals into space in a way calculated to influence/spur the development of life/culture in other parts of the galaxy.
I notice a lot of what I said above makes less sense if you think of AIs as having a similar skill profile to humans, but I think we agree that AIs might be much more advanced than people in some respects while still falling short of AGI because of weaknesses in other areas.
That observation also cuts against the argument you make about warning signs, I think, as it suggests that we might significantly underestimate an AIs (e.g. vastly superhuman) skill in some areas, if it still fails at some things we think are easy. To pull an example (not meant to be realistic) out of a hat: we might have AIs that can’t carry on a conversations, but can implement a very sophisticated covert world domination strategy.
Now I’m wondering if it makes sense to model past or present cognitive-cultural information processes in a similar fashion. Memetic and cultural evolutions are a thing and any agentlike processes that spawn could piggypack on our existing general intelligence architecture.
Yeah, I think it totally does! (and that’s a very interesting / “trippy” line of thought :D)
However, it does seem to me somewhat unlikely, since it does require fairly advanced intelligence, and I don’t think evolution is likely to have produced such advanced intelligence with us being totally unaware, whereas I think something about the way we train AI is more strongly selecting for “savant-like” intelligence, which is sort of what I’m imagining here. I can’t think of why I have that intuition OTTMH.
Nobody denies that AI is really good at extracting patterns out of statistical data (e.g. image classification, speech-to-text, and so on), even though AI is absolutely terrible at many “easy” things. This, and the linked comment from Eliezer, seem to be drastically underselling the competence of AI researchers. (I could imagine it happening with strong enough competitive pressures though.)
All of this assumes some very good long-term planning capabilities. I expect long-term planning to be one of the last capabilities that AI systems get. If I thought they would get them early, I’d be more worried about scenarios like these.
So I don’t take EY’s post as about AI researchers’ competence, as much as their incentives and levels of rationality and paranoia. It does include significant competitive pressures, which seems realistic to me.
I don’t think I’m underestimating AI researchers, either, but for a different reason… let me elaborate a bit: I think there are waaaaaay to many skills for us to hope to have a reasonable sense of what an AI is actually good at. By skills I’m imagining something more like options, or having accurate generalized value functions (GVFs), than tasks.
Regarding long-term planning, I’d factor this into 2 components:
1) having a good planning algorithm
2) having a good world model
I think the way long-term planning works is that you do short-term planning in a good hierarchical world model. I think AIs will have vastly superhuman planning algorithms (arguably, they already do), so the real bottleneck is the world-model.
I don’t think its necessary to have a very “complete” world-model (i.e. enough knowledge to look smart to a person) in order to find “steganographic” long-term strategies like the ones I’m imagining.
I also don’t think it’s even necessary to have anything that looks very much like a world-model. The AI can just have a few good GVFs.… (i.e. be some sort of savant).
I don’t think the only alternative to proof is empiricism. Lots of people reason about evolutionary biology/psychology with neither proof nor empiricism. The mesa optimizers paper involves neither proof nor empiricism.
You can also be empirical at that point though? I suppose you couldn’t be empirical if you expect an either an extremely fast takeoff (i.e. order one day or less) or an inability on our part to tell when the AI reaches human-level, but this seems overly pessimistic to me.
The mesa-optimizer paper, along with some other examples of important intellectual contributions to AI alignment, have two important properties:
They are part of a research program, not an end result. Rough intuitions can absolutely be a useful guide which (hopefully eventually) helps us figure out what mathematical results are possible and useful.
They primarily point at problems rather than solutions. Because (it seems to me) existential risk seems asymmetrically bad in comparison to potential technology upsides (large as upsides may be), I just have different standards of evidence for “significant risk” vs “significant good”. IE, an argument that there is a risk can be fairly rough and nonetheless be sufficient for me to “not push the button” (in a hypothetical where I could choose to turn on a system today). On the other hand, an argument that pushing the button is net positive has to be actually quite strong. I want there to be a small set of assumptions, each of which individually seem very likely to be true, which taken together would be a guarantee against catastrophic failure.
[This is an “or” condition—either one of those two conditions suffices for me to take vague arguments seriously.]
On the other hand, I agree with you that I set up a false dichotomy between proof and empiricism. Perhaps a better model would be a spectrum between “theory” and empiricism. Mathematical arguments are an extreme point of rigorous theory. Empiricism realistically comes with some amount of theory no matter what. And you could also ask for a “more of both” type approach, implying a 2d picture where they occupy separate dimensions.
Still, though, I personally don’t see much of a way to gain understanding about failure modes of very very capable systems using empirical observation of today’s systems. I especially don’t see an argument that one could expect all failure modes of very very capable systems to present themselves first in less-capable systems.
This is a normative argument, not an empirical one. The normative position seems reasonable to me, though I’d want to think more about it (I haven’t because it doesn’t seem decision-relevant).
The quick version is that to the extent that the system is adversarially optimizing against you, it had to at some point learn that that was a worthwhile thing to do, which we could notice. (This is assuming that capable systems are built via learning; if not then who knows what’ll happen.)
I am confused about how the normative question isn’t decision-relevant here. Is it that I have a model where it is the relevant question, but you have one where it isn’t? To be hopefully clear: I’m applying this normative claim to argue that proof is needed to establish the desired level of confidence. That doesn’t mean direct proof of the claim “the AI will do good”, but rather of supporting claims, perhaps involving the learning-theoretic properties of the system (putting bounds on errors of certain kinds) and such.
It’s possible that this isn’t my true disagreement, because actually the question seems more complicated than just a question of how large potential downsides are if things go poorly in comparison to potential upsides if things go well. But some kind of analysis of the risks seems relevant here—if there weren’t such large downside risks, I would have lower standards of evidence for claims that things will go well.
It sounds like we would have to have a longer discussion to resolve this. I don’t expect this to hit the mark very well, but here’s my reply to what I understand:
I don’t see how you can be confident enough of that view for it to be how you really want to check.
A system can be optimizing a fairly good proxy, so that at low levels of capability it is highly aligned, but this falls apart as the system becomes highly capable and figures out “hacks” around the “usual interpretation” of the proxy.
I also note that it seems like we disagree both about how useful proofs will be and about how useful empirical investigations will be (keeping in mind that those aren’t the only two things in the universe). I’m not sure which of those two disagreements is more important here.
Under my model, it’s overwhelmingly likely that regardless of what we do AGI will be deployed with less than the desired level of confidence in its alignment. If I personally controlled whether or not AGI was deployed, then I’d be extremely interested in the normative claim. If I then agreed with the normative claim, I’d agree with:
If I want >99% confidence, I agree that I couldn’t be confident enough in that argument.
Yeah, the hope here would be that the relevant decision-makers are aware of this dynamic (due to previous situations in which e.g. a recommender system optimized the fairly good proxy of clickthrough rate but this lead to “hacks” around the “usual interpretation”), and have some good reason to think that it won’t happen with the highly capable system they are planning to deploy.
Agreed. It also might be that we disagree on the tractability of proofs in addition to / instead of the utility of proofs.