I really wish this post took a different rhetorical tack. Claims like, for example, the one that the reader should engage with your argument because “it has been certified as valid by professional computer scientists” do the post a real disservice. And they definitely made me disinclined to continue reading.
Not trying to be arrogant. Just trying to present readers who have limited time a quickly digestible bit evidence about the likelihood that the argument is a shambles.
Peer review is not a certification of validity, even in more rigorous venues. Not even close.
I am used to seeing questionable claims forwarded under headlines like “new published study says XYZ”.
That XYZ was peer reviewed is one of the weaker arguments one could make in its favor, so when someone uses that as a selling point, it indicates to me that there aren’t better reasons to believe in XYZ.
(Analogously, when I see an ML paper boast that their new method is “competitive with” the SOTA, I immediately think “That means they tried to beat the SOTA, but found their method was at least a little worse. If it was better, they would’ve said so.”)
Peer review can definitely issue certificates mistakenly, but validity is what it aims to certify.
No it doesn’t. It’s hard to say what the “aims” of peer-review are, but “ensuring validity” is certainly not one of them. As a first approximation, I’d say that peer-review aims to certify that the author is not an obvious crank, and that the argument being made is an interesting one to someone in the field.
Care to bet on the results of a survey of academic computer scientists? If the stakes are high enough, I could try to make it happen.
“As a reviewer, I only recommend for acceptance papers that appear to be both valid and interesting.”
Strongly agree - … - Strongly Disagree
“As a reviewer, I would sooner recommend for acceptance a paper that was valid, but not incredibly interesting, than a paper that was interesting, but the conclusions weren’t fully supported by the analysis.”
Care to bet on the results of a survey of academic computer scientists? If the stakes are high enough, I could try to make it happen.
No, no more than I would bet on a survey of <insert religious group here> whether they think <religious group> is more virtuous than <non-religious group>. Academics may claim that peer review is to check validity but their actions tell a different story. This is especially true in “hard” fields like mathematics where reviewers may even struggle to follow an argument, let alone check its validity. Given that most papers are never read by others, this is really not a big deal though.
But I’ll offer three further arguments for why I don’t think peer review ensures validity.
Argument 1:
a) Humans (including reviewers) make mistakes all the time, but
b) Retractions/corrections in papers are very rare.
Unless academics are better at spotting mistakes immediately when reviewing than everyone else (they are not), we should expect lots of peer-reviewed articles to therefore have mistakes because invalid papers rarely get retracted.
Argument 2:
Computer science papers don’t always include reproducible software, but checking code would absolutely be required to check validity.
Argument 3:
It is customary to submit papers that are rejected by one journal to another journal. This means that articles that fail “peer review” at one journal can obtain “peer review” at a different journal.
Me: Peer review can definitely issue certificates mistakenly, but validity is what it aims to certify.
You: No it doesn’t. They just care about interestingness.
Me: Do you agree reviewers aim to only accept valid papers, and care more about validity than interestingness?
You: Yes, but...
If you can admit that we agree on this basic point, I’m happy to discuss further about how good they are at what they aim to do.
1: If retractions were common, surely you would have said that was evidence peer review didn’t accomplish much! If academics were only equally good at spotting mistakes immediately, they would still spot the most mistakes because they get the first opportunity to. And if they do, others don’t get a “chance” to point out a flaw and have the paper retracted. Even though this argument fails, I agree that journals are too reluctant to publish retractions; pride can sometimes get in the way of good science. But that has no bearing on their concern for validity at the reviewing stage.
2: Some amount of trust is taken for granted in science. The existence of trust in a scientific field does not imply that the participants don’t actually care about the truth. Bounded Distrust.
3: Since some level interestingness is also required for publication, this is consistent with a top venue having a higher bar for interestingness than a lesser venue, even while they same requirement for validity. And this is definitely in fact the main effect at play. But yes, there are also some lesser journals/conferences/workshops where they are worse at checking validity, or they care less about it because they are struggling to publish enough articles to justify their existence, or because they are outright scams. So it is relevant that AAAI publishes AI Magazine, and their brand is behind it. I said “peer reviewed” instead of “peer reviewed at a top venue” because the latter would have rubbed you the wrong way even more, but I’m only claiming that passing peer review is worth a lot at a top venue.
Me: Do you agree reviewers aim to only accept valid papers, and care more about validity than interestingness?
I’ve reviewed papers. I didn’t spend copious amounts of time checking the proofs. Some/most reviewers may claim to only accept “valid papers” (whatever that means), but the way the system is set up peer review serves mainly as a filter to filter out blatantly bad papers. Sure, people try to catch the obviously invalid papers. And sure, many researches really try to find mistakes. But at the end of the day, you can always get your results published somewhere, and once something is published, it is almost never retracted.
If retractions were common, surely you would have said that was evidence peer review didn’t accomplish much!
Sure, let me retract my previous argument and amend it with the additional statement that even when a paper is known to have mistakes by the community, it is almost never retracted.
2: Some amount of trust is taken for granted in science. The existence of trust in a scientific field does not imply that the participants don’t actually care about the truth. Bounded Distrust.
I don’t think that this refutes my argument like you think it does. Reviewers don’t check software because they don’t have the capacity to check software. It is well-known that all non-trivial software contains bugs. Reviewers accept this, because at the end of the day they don’t comprehensively check validity.
because the latter would have rubbed you the wrong way even more
No, I think that peer review at a good journal is worth much more than peer review at a bad journal.
I think our disagreement comes down to the stated intent being to check validity, and me arguing that the actual effect is to offer a filter for poorly written/ not interesting articles. There is obviously some overlap, as nobody will find an obviously invalid article interesting! Depending on the journal, this may come close to checking some kind of validity. I trust an article in Annals of Mathematics to be correct in a way that I don’t trust an article in PNAS to be. We can compare peer-review with the FDA—the stated intent is to offer safe medications to the population. The actual effect is …
Do you think the peer reviewers and the editors thought the argument was valid?
Maybe? I really don’t feel comfortable speculating like that about their thinking. What exactly peer review entails & what exactly reviewers expect varies a lot based on the field and journal/conference.
Hey, wanted to chip into the comments here because they are disappointingly negative.
I think your paper and this post are extremely good work. They won’t push forward the all-things-considered viewpoint, but they surely push forward the lower bound (or adversarial) viewpoint. Also because Open Phil and Future Fund use some fraction of lower-end risk in their estimate, this should hopefully wipe that put. Together they much more rigorously lay out classic x-risk arguments.
I think that getting the prior work peer reviewed is also a massive win at least in a social sense. While it isn’t much of a signal here on LW, it is in the wider world. I have very high confidence that I will be referring to that paper in arguments I have in the future, any time the other participant doesn’t give me the benefit of the doubt.
I was feeling disappointed about the lack of positive comments, and I realized recently I should probably go around commenting on posts that I think are good, since right now, I mostly only comment on posts when I feel I have an important disagreement. So it’s hard to complain when I’m on the receiving end of that dynamic.
Assumption 1. A sufficiently advanced agent will do at least human-level hypothesis generation regarding the dynamics of the unknown environment.
I am fairly confident that this is not the part TurnTrout/Quintin were disagreeing with you on. Such an agent plausibly will be doing at least human-level hypothesis generation. The question is on what goals will be driving the agent. A monk may be able to generate the hypothesis that narcotics would feel intensely rewarding, more rewarding than any meditation they have yet experienced, and that if they took those narcotics, their goals would shift towards them. And yet, even after generating that hypothesis, that monk may still choose not to conduct that intervention because they know that it would redirect them towards proximal reward-related chemical-goals and away from distal reward-related experiential-goals (seeing others smile, for ex.).
Also, I am not even sure there is actually a disagreement on whether agents will intervene on the reward-generating process. Quote from Reward is not the optimization target:
Quintin Pope remarks: “The AI would probably want to establish control over the button, if only to ensure its values aren’t updated in a way it wouldn’t endorse. Though that’s an example of convergent powerseeking, not reward seeking.”
That is, the agent will probably want to intervene on the process that is shaping its goals. In fact, establishing control over the process that updates its cognition is instrumentally convergent, no matter what goal it is pursuing.
In the video game playing setting you describe, it is perfectly conceivable that the agent deliberately acts to optimize for high in-game scores without being terminally motivated by reward, instead doing that deliberate optimization for instrumental reasons (they like video games, they are competitive, they have a weird obsession with virtual points, etc.). This is what I believe Quintin meant by “The same way humans do it?”
To understand why they believe that matters at all for understanding the behavior of a reinforcement learner (as opposed to a human), we can look to another blog post of theirs.
Let’s look at the assumptions they make. They basically assume that the human brain only does reinforcement learning. (Their Assumption 3 says the brain does reinforcement learning, and Assumption 1 says that this brain-as-reinforcement-learner is randomly initialized, so there is no other path for goals to come in.) [...] In this blog post, the words “innate” and “instinct” never appear.
Whoa whoa whoa. This is definitely a misunderstanding. Assumption 2 is all about how the brain does self-supervised learning in addition to “pure reinforcement learning”. Moreover, if you look at the shard theory post, it talks several times about how the genome indirectly shapes the brain’s goal structure, whenever the post mentions “hard[-]coded reward circuits”. It even says so right in the bit that introduces Assumption 3!
Assumption 3: The brain does reinforcement learning. According to this assumption, the brain has a genetically hard-coded reward system (implemented via certain hard-coded circuits in the brainstem and midbrain).
Those “hard-coded” reward circuits are what you would probably instead call “innate” and form the basis for some subset of the “instincts” relevant to this discussion. Perhaps you were searching using different words, and got the wrong impression because of it? This one seems like a pretty clear miscommunication.
Incidentally, I am also confused about how you reach your published conclusion, the one ending in “with catastrophic consequences”, from your 6 assumptions alone. The portion of it that I follow is that advanced agents may intervene in the provision of rewards, but I don’t see how much else follows without further assumptions...
The assumption says “will do” not “will be able to do”. And the dynamics of the unknown environment includes the way it outputs rewards. So the assumption was not written in a way that clearly flags its entailment of the agent deliberately modeling the origin of reward, and I regret that, but it does entail that. So that was why engage with the objection that reward is not the optimization target under this section.
In the video game playing setting you describe, it is perfectly conceivable that the agent deliberately acts to optimize for high in-game scores without being terminally motivated by reward,
There is no need to recruit the concept of “terminal” here for following the argument about the behavior of a policy that performs well according to the RL objective. If the video game playing agent refines its understanding of “success” according to how much reward it observes, and then pursues success, but it does all this because of some “terminal” reason X, that still amounts to deliberate reward optimization, and this policy still satisfies Assumptions 1-4.
If I want to analyze what would probably happen if Edward Snowden tried to enter the White House, there’s lots I can say without needing to understand what deep reason he had for trying to do this. I can just look at the implications of his attempt to enter the White House: he’d probably get caught and go to jail for a long time. Likewise, if an RL agent is trying to maximize is reward, there’s plenty of analysis we can do that is independent of whether there’s some other terminal reason for this.
The assumption says “will do” not “will be able to do”. And the dynamics of the unknown environment includes the way it outputs rewards. So the assumption was not written in a way that clearly flags its entailment of the agent deliberately modeling the origin of reward, and I regret that, but it does entail that. So that was why engage with the objection that reward is not the optimization target under this section.
That’s fine. Let’s say the agent “will do at least human-level hypothesis generation regarding the dynamics of the unknown environment”. That still does not imply that reward is their optimization target. The monk in my analogy in fact does such hypothesis generation, deliberately modeling the origin of reward, and yet reward is not what they are seeking.
There is no need to recruit the concept of “terminal” here for following the argument about the behavior of a policy that performs well according to the RL objective. If the video game playing agent refines its understanding of “success” according to how much reward it observes, and then pursues success, but it does all this because of some “terminal” reason X, that still amounts to deliberate reward optimization, and this policy still satisfies Assumptions 1-4.
The reason we are talking about terminal vs. instrumental stuff is that your claims seem to be about how agents will learn what their goals are from the environment. But if advanced agents can use their observations & interactions with the environment to learn information about instrumental means (i.e. how do my actions create expected outcomes, how do I build a better world model, etc.) while holding their terminal goals (i.e. which outcomes I intrinsically care about) fixed, then that should change our conclusions about how advanced agents will behave, because competent agents reshape their instrumental behavior to subserve their terminal values, rather than inferring their terminal values from the environment or what have you. This is what “goal-content integrity”—a dimension of instrumental convergence—is all about.
If the video game playing agent refines its understanding of “success” according to how much reward it observes, and then pursues success
The video game player doesn’t want high reward that comes from cheating. It is not behaviourally identical to a reward maximiser unless you take the reward to be the quantity “what I would’ve received if I hadn’t cheated”
The title suggests (weakly perhaps) that the estimates themselves peer-reviewed. Would be clearer to write “building on” peer reviewed argument, or similar.
At the moment, I am particularly interested in the structure of proposed x-risk models (more than the specific conclusions). Lately, there has been a lot of attention on Carlsmith-style decompositions, which have the form “Catastrophe occurs if this conjunction of events occur”. I found it interesting that this post took the upside-down version of that, i.e., “Catastrophe is inevitable unless (one of) these things happen”.
Why do I find this distinction relevant? Consider how non-informed our assessments for most of these factors in these models actually are. Once you include this meta-uncertainty, Jensen’s inequality implies that the mean & median risk of catastrophe decreases with greater meta-uncertainty, whereas when you turn the argument upside down, it also inverts Jensen’s inequality, such that mean & median risk of catastrophe would increase with greater meta-uncertainty.
This is a complex topic that has more nuances than are warranted in a comment. I’ll mention that Michael’s actual argument uses “unless **one of** these things happen”, which is additive (thus not subject to the Jensen’s inequality phenomena). But because the model is structured in this polarity, Jensen’s would kick in as soon as he introduces a conjunction of factors, which I see as something that would naturally occur with this style of model structure. Even though this post’s is additive, I give this article credit for triggering this insight for me.
Also, thank you for your Appendices A & B that describe your opinions about which approaches to alignment you see as promising and non-promising.
Thanks for the post and especially for the peer-reviewed paper! Without disregarding the great non-peer-reviewed work that many others are doing, I do think it is really important to get the most important points peer-reviewed as well, preferably as explicit as possible (e.g. also mentioning human extinction, timelines, lower bound estimates, etc). Thanks as well for spelling out your lower bound probabilities, I think we should have this discussion more often, more structurally, and more widely (also with people outside of the AI xrisk community). I guess I’m also in the same ballpark regarding the options and numbers (perhaps a bit more optimistic).
Quick question:
“3.1.1. Practical laws exist which would, if followed, preclude dangerous AI. 100% (recall this is optimistically-biased, but I do tentatively think this is likely, having drafted such a law).”
Very interesting argument. I’d be interested in using a spreadsheet where I can fill out my own probabilities for each claim, as it’s not immediately clear to me how you’ve combined them here.
M. There’s nowhere that Jurgen Schmidhuber (currently in Saudi Arabia!) wants to move where he’s allowed to work on dangerously advanced AI, or he retires before he can make it. 50%
These credences feel borderline contradictory to me. M implies you believe that, conditional on no laws being passed which would make it illegal in any place he’d consider moving to, Jurgen Schmidhuber in particular has a >50% chance of building dangerously advanced AI within 20 years or so. Since you also believe the EU has a 90% chance of passing such a law before the creation of dangerously advanced AI, this implies you believe the EU has a >80% chance of outlawing the creation of dangerously advanced AI within 20 years or so. In fact, if we assume a uniform distribution over when JS builds dangerously advanced AI (such that it’s cumulatively 50% 20 years from now), that requires us to be nearly certain the EU would pass such a law within 10 years if we make it that long before JS succeeds. From where does such high confidence stem?
(Meta: I’m also not convinced it’s generally a good policy to be “naming names” of AGI researchers who are relatively unconcerned about the risks in serious discussions about AGI x-risk, since this could provoke a defensive response, “doubling down”, etc.)
I don’t understand. Importantly, these are optimistically biased, and you can’t assume my true credences are this high. I assign much less than 90% probability to C. But still, they’re perfectly consistent. M doesn’t say anything about succeeding—only being allowed. M is basically saying: listing the places he’d be willing to live, do they all pass laws which would make building dangerously advanced AI illegal? The only logical connection between C and M is that M (almost definitely) implies C.
I think you make a lot of good points, but I do have a big objection:
the set of successful policies only includes ones that entertain different models of the origin of reward, and then pick actions to maximize predicted future rewards.
This is obviously an additional assumption, and not a compelling one. The relevant sense of success is “doing the right thing, whatever that means”, and reward maximisers are apparently not very good at this.
You argue that someone hasn’t yet given you a compelling alternative—but that’s a very weak reason to say that doing better is impossible. As we both agree, the set of all policies is very large.
A more plausible assumption is that people don’t manage to do better than reward maximisers, even though they’re not very good. I have doubts about this too, but I think it can at least be argued.
I don’t think it’s an assumption really. I think this sentence just fixes the meanings, in perfectly sensible ways, of the words “entertain” and “to” (as in “pick actions to”). I guess you’re not persuaded that competent behavior in the “many new video games” environment is deserving of the description “aiming to maximize predicted future rewards”. Why is that, if the video games are sufficiently varied?
Fixing the meanings of keywords is one of the important things that assumptions do, and the scope of the quoted claim is larger than you suggest.
You argue a sufficiently competent reward maximiser will intervene in the provision of rewards and will not in fact achieve high scores in many video games. An insufficiently competent reward maximiser will be outperformed by a more competent “intrinsically motivated video game player” (I think we can understand what this means well enough, even if we can’t give a formal specification). So I don’t think reward maximisers are the best video game players, with implies that competent video game playing is not equivalent to reward maximisation.
If you allow maximisers of “latent rewards”, then other parts of your argument become much less plausible. What does intervening on a latent reward even mean? Is it problematic?
I also think that video games are a special enough category that “video game competence” is probably a poor proxy for AI desirability.
I also think that video games are a special enough category that “video game competence” is probably a poor proxy for AI desirability.
I agree on poor proxy for AI desirability. To add to that problem, a game developer in one country can have values and goals opposite to game developers in a different country/region.
When you said “poor proxy for AI desirability”, the first thing that came to mind was dogs faces on deepmind early illustrations, because dataset contained too many dogs. Same false optimal policy can be obtained from video games.
There can only be One X-risk. I wonder if anyone here believes that our first X-risk is not coming from AGI, that AGI is our fastest counter to our first X-risk.
In that case you are resisting what you should embrace. And I don’t blame you for that. Our age is driven by entertainment. As Elon said on Twitter “The most entertaining outcome is the most likely”. Next breakthrough in AGI will only happen when someone brings out “entertaining ” enough X-risk scenario for you to feast on. At that point all Safety AI protocols will be reworked and you will open your mind to new out of the box image of the situation you are in now.
I really wish this post took a different rhetorical tack. Claims like, for example, the one that the reader should engage with your argument because “it has been certified as valid by professional computer scientists” do the post a real disservice. And they definitely made me disinclined to continue reading.
Not trying to be arrogant. Just trying to present readers who have limited time a quickly digestible bit evidence about the likelihood that the argument is a shambles.
It didn’t strike me as arrogant. It struck me as misleading in a way that made me doubt the quality of the enclosed argument.
Quick question, but why do you have that reaction?
Peer review is not a certification of validity, even in more rigorous venues. Not even close.
I am used to seeing questionable claims forwarded under headlines like “new published study says XYZ”.
That XYZ was peer reviewed is one of the weaker arguments one could make in its favor, so when someone uses that as a selling point, it indicates to me that there aren’t better reasons to believe in XYZ. (Analogously, when I see an ML paper boast that their new method is “competitive with” the SOTA, I immediately think “That means they tried to beat the SOTA, but found their method was at least a little worse. If it was better, they would’ve said so.”)
Do you think the peer reviewers and the editors thought the argument was valid?
Peer review can definitely issue certificates mistakenly, but validity is what it aims to certify.
No it doesn’t. It’s hard to say what the “aims” of peer-review are, but “ensuring validity” is certainly not one of them. As a first approximation, I’d say that peer-review aims to certify that the author is not an obvious crank, and that the argument being made is an interesting one to someone in the field.
Care to bet on the results of a survey of academic computer scientists? If the stakes are high enough, I could try to make it happen.
“As a reviewer, I only recommend for acceptance papers that appear to be both valid and interesting.”
Strongly agree - … - Strongly Disagree
“As a reviewer, I would sooner recommend for acceptance a paper that was valid, but not incredibly interesting, than a paper that was interesting, but the conclusions weren’t fully supported by the analysis.”
Strongly agree - … - Strongly Disagree
No, no more than I would bet on a survey of <insert religious group here> whether they think <religious group> is more virtuous than <non-religious group>. Academics may claim that peer review is to check validity but their actions tell a different story. This is especially true in “hard” fields like mathematics where reviewers may even struggle to follow an argument, let alone check its validity. Given that most papers are never read by others, this is really not a big deal though.
But I’ll offer three further arguments for why I don’t think peer review ensures validity.
Argument 1: a) Humans (including reviewers) make mistakes all the time, but b) Retractions/corrections in papers are very rare.
Unless academics are better at spotting mistakes immediately when reviewing than everyone else (they are not), we should expect lots of peer-reviewed articles to therefore have mistakes because invalid papers rarely get retracted.
Argument 2: Computer science papers don’t always include reproducible software, but checking code would absolutely be required to check validity.
Argument 3: It is customary to submit papers that are rejected by one journal to another journal. This means that articles that fail “peer review” at one journal can obtain “peer review” at a different journal.
PS: For CS it’s harder to check “validity”, but here’s how papers replicate in other fields: https://fantasticanachronism.com/2021/11/18/how-i-made-10k-predicting-which-papers-will-replicate/
Me: Peer review can definitely issue certificates mistakenly, but validity is what it aims to certify.
You: No it doesn’t. They just care about interestingness.
Me: Do you agree reviewers aim to only accept valid papers, and care more about validity than interestingness?
You: Yes, but...
If you can admit that we agree on this basic point, I’m happy to discuss further about how good they are at what they aim to do.
1: If retractions were common, surely you would have said that was evidence peer review didn’t accomplish much! If academics were only equally good at spotting mistakes immediately, they would still spot the most mistakes because they get the first opportunity to. And if they do, others don’t get a “chance” to point out a flaw and have the paper retracted. Even though this argument fails, I agree that journals are too reluctant to publish retractions; pride can sometimes get in the way of good science. But that has no bearing on their concern for validity at the reviewing stage.
2: Some amount of trust is taken for granted in science. The existence of trust in a scientific field does not imply that the participants don’t actually care about the truth. Bounded Distrust.
3: Since some level interestingness is also required for publication, this is consistent with a top venue having a higher bar for interestingness than a lesser venue, even while they same requirement for validity. And this is definitely in fact the main effect at play. But yes, there are also some lesser journals/conferences/workshops where they are worse at checking validity, or they care less about it because they are struggling to publish enough articles to justify their existence, or because they are outright scams. So it is relevant that AAAI publishes AI Magazine, and their brand is behind it. I said “peer reviewed” instead of “peer reviewed at a top venue” because the latter would have rubbed you the wrong way even more, but I’m only claiming that passing peer review is worth a lot at a top venue.
I’ve reviewed papers. I didn’t spend copious amounts of time checking the proofs. Some/most reviewers may claim to only accept “valid papers” (whatever that means), but the way the system is set up peer review serves mainly as a filter to filter out blatantly bad papers. Sure, people try to catch the obviously invalid papers. And sure, many researches really try to find mistakes. But at the end of the day, you can always get your results published somewhere, and once something is published, it is almost never retracted.
Sure, let me retract my previous argument and amend it with the additional statement that even when a paper is known to have mistakes by the community, it is almost never retracted.
I don’t think that this refutes my argument like you think it does. Reviewers don’t check software because they don’t have the capacity to check software. It is well-known that all non-trivial software contains bugs. Reviewers accept this, because at the end of the day they don’t comprehensively check validity.
No, I think that peer review at a good journal is worth much more than peer review at a bad journal.
I think our disagreement comes down to the stated intent being to check validity, and me arguing that the actual effect is to offer a filter for poorly written/ not interesting articles. There is obviously some overlap, as nobody will find an obviously invalid article interesting! Depending on the journal, this may come close to checking some kind of validity. I trust an article in Annals of Mathematics to be correct in a way that I don’t trust an article in PNAS to be. We can compare peer-review with the FDA—the stated intent is to offer safe medications to the population. The actual effect is …
Maybe? I really don’t feel comfortable speculating like that about their thinking. What exactly peer review entails & what exactly reviewers expect varies a lot based on the field and journal/conference.
Hey, wanted to chip into the comments here because they are disappointingly negative.
I think your paper and this post are extremely good work. They won’t push forward the all-things-considered viewpoint, but they surely push forward the lower bound (or adversarial) viewpoint. Also because Open Phil and Future Fund use some fraction of lower-end risk in their estimate, this should hopefully wipe that put. Together they much more rigorously lay out classic x-risk arguments.
I think that getting the prior work peer reviewed is also a massive win at least in a social sense. While it isn’t much of a signal here on LW, it is in the wider world. I have very high confidence that I will be referring to that paper in arguments I have in the future, any time the other participant doesn’t give me the benefit of the doubt.
Thank you very much for saying that.
I was feeling disappointed about the lack of positive comments, and I realized recently I should probably go around commenting on posts that I think are good, since right now, I mostly only comment on posts when I feel I have an important disagreement. So it’s hard to complain when I’m on the receiving end of that dynamic.
Object-level comments below.
Clearing up some likely misunderstandings:
I am fairly confident that this is not the part TurnTrout/Quintin were disagreeing with you on. Such an agent plausibly will be doing at least human-level hypothesis generation. The question is on what goals will be driving the agent. A monk may be able to generate the hypothesis that narcotics would feel intensely rewarding, more rewarding than any meditation they have yet experienced, and that if they took those narcotics, their goals would shift towards them. And yet, even after generating that hypothesis, that monk may still choose not to conduct that intervention because they know that it would redirect them towards proximal reward-related chemical-goals and away from distal reward-related experiential-goals (seeing others smile, for ex.).
Also, I am not even sure there is actually a disagreement on whether agents will intervene on the reward-generating process. Quote from Reward is not the optimization target:
That is, the agent will probably want to intervene on the process that is shaping its goals. In fact, establishing control over the process that updates its cognition is instrumentally convergent, no matter what goal it is pursuing.
In the video game playing setting you describe, it is perfectly conceivable that the agent deliberately acts to optimize for high in-game scores without being terminally motivated by reward, instead doing that deliberate optimization for instrumental reasons (they like video games, they are competitive, they have a weird obsession with virtual points, etc.). This is what I believe Quintin meant by “The same way humans do it?”
Whoa whoa whoa. This is definitely a misunderstanding. Assumption 2 is all about how the brain does self-supervised learning in addition to “pure reinforcement learning”. Moreover, if you look at the shard theory post, it talks several times about how the genome indirectly shapes the brain’s goal structure, whenever the post mentions “hard[-]coded reward circuits”. It even says so right in the bit that introduces Assumption 3!
Those “hard-coded” reward circuits are what you would probably instead call “innate” and form the basis for some subset of the “instincts” relevant to this discussion. Perhaps you were searching using different words, and got the wrong impression because of it? This one seems like a pretty clear miscommunication.
Incidentally, I am also confused about how you reach your published conclusion, the one ending in “with catastrophic consequences”, from your 6 assumptions alone. The portion of it that I follow is that advanced agents may intervene in the provision of rewards, but I don’t see how much else follows without further assumptions...
The assumption says “will do” not “will be able to do”. And the dynamics of the unknown environment includes the way it outputs rewards. So the assumption was not written in a way that clearly flags its entailment of the agent deliberately modeling the origin of reward, and I regret that, but it does entail that. So that was why engage with the objection that reward is not the optimization target under this section.
There is no need to recruit the concept of “terminal” here for following the argument about the behavior of a policy that performs well according to the RL objective. If the video game playing agent refines its understanding of “success” according to how much reward it observes, and then pursues success, but it does all this because of some “terminal” reason X, that still amounts to deliberate reward optimization, and this policy still satisfies Assumptions 1-4.
If I want to analyze what would probably happen if Edward Snowden tried to enter the White House, there’s lots I can say without needing to understand what deep reason he had for trying to do this. I can just look at the implications of his attempt to enter the White House: he’d probably get caught and go to jail for a long time. Likewise, if an RL agent is trying to maximize is reward, there’s plenty of analysis we can do that is independent of whether there’s some other terminal reason for this.
That’s fine. Let’s say the agent “will do at least human-level hypothesis generation regarding the dynamics of the unknown environment”. That still does not imply that reward is their optimization target. The monk in my analogy in fact does such hypothesis generation, deliberately modeling the origin of reward, and yet reward is not what they are seeking.
The reason we are talking about terminal vs. instrumental stuff is that your claims seem to be about how agents will learn what their goals are from the environment. But if advanced agents can use their observations & interactions with the environment to learn information about instrumental means (i.e. how do my actions create expected outcomes, how do I build a better world model, etc.) while holding their terminal goals (i.e. which outcomes I intrinsically care about) fixed, then that should change our conclusions about how advanced agents will behave, because competent agents reshape their instrumental behavior to subserve their terminal values, rather than inferring their terminal values from the environment or what have you. This is what “goal-content integrity”—a dimension of instrumental convergence—is all about.
The video game player doesn’t want high reward that comes from cheating. It is not behaviourally identical to a reward maximiser unless you take the reward to be the quantity “what I would’ve received if I hadn’t cheated”
The title suggests (weakly perhaps) that the estimates themselves peer-reviewed. Would be clearer to write “building on” peer reviewed argument, or similar.
Thank you. I’ve changed the title.
At the moment, I am particularly interested in the structure of proposed x-risk models (more than the specific conclusions). Lately, there has been a lot of attention on Carlsmith-style decompositions, which have the form “Catastrophe occurs if this conjunction of events occur”. I found it interesting that this post took the upside-down version of that, i.e., “Catastrophe is inevitable unless (one of) these things happen”.
Why do I find this distinction relevant? Consider how non-informed our assessments for most of these factors in these models actually are. Once you include this meta-uncertainty, Jensen’s inequality implies that the mean & median risk of catastrophe decreases with greater meta-uncertainty, whereas when you turn the argument upside down, it also inverts Jensen’s inequality, such that mean & median risk of catastrophe would increase with greater meta-uncertainty.
This is a complex topic that has more nuances than are warranted in a comment. I’ll mention that Michael’s actual argument uses “unless **one of** these things happen”, which is additive (thus not subject to the Jensen’s inequality phenomena). But because the model is structured in this polarity, Jensen’s would kick in as soon as he introduces a conjunction of factors, which I see as something that would naturally occur with this style of model structure. Even though this post’s is additive, I give this article credit for triggering this insight for me.
Also, thank you for your Appendices A & B that describe your opinions about which approaches to alignment you see as promising and non-promising.
Great comment, this clarified the distinction of these arguments to me. And IMO this (Michael’s) argument is obviously the correct way to look at it.
Thanks for the post and especially for the peer-reviewed paper! Without disregarding the great non-peer-reviewed work that many others are doing, I do think it is really important to get the most important points peer-reviewed as well, preferably as explicit as possible (e.g. also mentioning human extinction, timelines, lower bound estimates, etc). Thanks as well for spelling out your lower bound probabilities, I think we should have this discussion more often, more structurally, and more widely (also with people outside of the AI xrisk community). I guess I’m also in the same ballpark regarding the options and numbers (perhaps a bit more optimistic).
Quick question:
“3.1.1. Practical laws exist which would, if followed, preclude dangerous AI. 100% (recall this is optimistically-biased, but I do tentatively think this is likely, having drafted such a law).”
Can you share (a link to) the law you drafted?
Very interesting argument. I’d be interested in using a spreadsheet where I can fill out my own probabilities for each claim, as it’s not immediately clear to me how you’ve combined them here.
From section 3.1.2:
These credences feel borderline contradictory to me. M implies you believe that, conditional on no laws being passed which would make it illegal in any place he’d consider moving to, Jurgen Schmidhuber in particular has a >50% chance of building dangerously advanced AI within 20 years or so. Since you also believe the EU has a 90% chance of passing such a law before the creation of dangerously advanced AI, this implies you believe the EU has a >80% chance of outlawing the creation of dangerously advanced AI within 20 years or so. In fact, if we assume a uniform distribution over when JS builds dangerously advanced AI (such that it’s cumulatively 50% 20 years from now), that requires us to be nearly certain the EU would pass such a law within 10 years if we make it that long before JS succeeds. From where does such high confidence stem?
(Meta: I’m also not convinced it’s generally a good policy to be “naming names” of AGI researchers who are relatively unconcerned about the risks in serious discussions about AGI x-risk, since this could provoke a defensive response, “doubling down”, etc.)
I don’t understand. Importantly, these are optimistically biased, and you can’t assume my true credences are this high. I assign much less than 90% probability to C. But still, they’re perfectly consistent. M doesn’t say anything about succeeding—only being allowed. M is basically saying: listing the places he’d be willing to live, do they all pass laws which would make building dangerously advanced AI illegal? The only logical connection between C and M is that M (almost definitely) implies C.
I think you make a lot of good points, but I do have a big objection:
This is obviously an additional assumption, and not a compelling one. The relevant sense of success is “doing the right thing, whatever that means”, and reward maximisers are apparently not very good at this.
You argue that someone hasn’t yet given you a compelling alternative—but that’s a very weak reason to say that doing better is impossible. As we both agree, the set of all policies is very large.
A more plausible assumption is that people don’t manage to do better than reward maximisers, even though they’re not very good. I have doubts about this too, but I think it can at least be argued.
I don’t think it’s an assumption really. I think this sentence just fixes the meanings, in perfectly sensible ways, of the words “entertain” and “to” (as in “pick actions to”). I guess you’re not persuaded that competent behavior in the “many new video games” environment is deserving of the description “aiming to maximize predicted future rewards”. Why is that, if the video games are sufficiently varied?
Fixing the meanings of keywords is one of the important things that assumptions do, and the scope of the quoted claim is larger than you suggest.
You argue a sufficiently competent reward maximiser will intervene in the provision of rewards and will not in fact achieve high scores in many video games. An insufficiently competent reward maximiser will be outperformed by a more competent “intrinsically motivated video game player” (I think we can understand what this means well enough, even if we can’t give a formal specification). So I don’t think reward maximisers are the best video game players, with implies that competent video game playing is not equivalent to reward maximisation.
If you allow maximisers of “latent rewards”, then other parts of your argument become much less plausible. What does intervening on a latent reward even mean? Is it problematic?
I also think that video games are a special enough category that “video game competence” is probably a poor proxy for AI desirability.
I agree on poor proxy for AI desirability. To add to that problem, a game developer in one country can have values and goals opposite to game developers in a different country/region.
When you said “poor proxy for AI desirability”, the first thing that came to mind was dogs faces on deepmind early illustrations, because dataset contained too many dogs. Same false optimal policy can be obtained from video games.
There can only be One X-risk. I wonder if anyone here believes that our first X-risk is not coming from AGI, that AGI is our fastest counter to our first X-risk.
In that case you are resisting what you should embrace. And I don’t blame you for that. Our age is driven by entertainment. As Elon said on Twitter “The most entertaining outcome is the most likely”. Next breakthrough in AGI will only happen when someone brings out “entertaining ” enough X-risk scenario for you to feast on. At that point all Safety AI protocols will be reworked and you will open your mind to new out of the box image of the situation you are in now.