That requires, not the ability to read this document and nod along with it, but the ability to spontaneously write it from scratch without anybody else prompting you; that is what makes somebody a peer of its author. It’s guaranteed that some of my analysis is mistaken, though not necessarily in a hopeful direction. The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so. Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly—such that somebody out there had the knowledge to write all of this themselves, if they could only have written it up, but they couldn’t write, so didn’t try. I’m not particularly hopeful of this turning out to be true in real life, but I suppose it’s one possible place for a “positive model violation” (miracle). The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years into even normies starting to notice the death game, it is still Eliezer Yudkowsky writing up this list, says that humanity still has only one gamepiece that can do that. I knew I did not actually have the physical stamina to be a star researcher, I tried really really hard to replace myself before my health deteriorated further, and yet here I am writing this. That’s not what surviving worlds look like.
Something bugged me about this paragraph, until I realized: If you actually wanted to know whether or not this was true, you could have just asked Nate Soares, Paul Christiano, or anybody else you respected to write this post first, then removed all doubt by making a private comparison. If you had enough confidence in the community you could have even made it into a sequence; gather up all of the big alignment researchers’ intuitions on where the Filters are and then let us make our own opinion up on which was most salient.
Instead, now we’re in a situation where, I expect, if anybody writes something basically similar you will just posit that they can’t really do alignment research because they couldn’t have written it “from the null string” like you did. Doing this would literally have saved you work on expectation, and it seems obvious enough for me to be suspicious as to why you didn’t think of it.
I tried something like this much earlier with a single question, “Can you explain why it’d be hard to make an AGI that believed 222 + 222 = 555”, and got enough pushback from people who didn’t like the framing that I shelved the effort.
My attempt (thought about it for a minute or two):
Because arithmetic is useful, and the self-contradictory version of arithmetic, where 222+222=555 allows you to prove anything and is useless. Therefore, a smart AI that wants and can invent useful abstractions will invent its own (isomorphic to our arithmetic, in which 222+222=444) arithmetic from scratch and will use it for practical purposes, even if we can force it not to correct an obvious error.
I think this is the right answer. Just to expand on this a bit: The problem isn’t necessarily that 222+222=555 leads to a contradiction with the rest of arithmetic. One can imagine that instead of defining “+” using “x+Sy=y+Sx”, we could give it a much more complex definition where there is a special case carved out for certain values like 222. The issue is that the AI has no reason to use this version of “+” and will define some other operation that works just like actual addition. Even if we ban the AI from using “x+Sy=y+Sx” to define any operations, it will choose the nearest thing isomorphic to addition that we haven’t blocked, because addition is so common and useful. Or maybe it will use the built-in addition, but whenever it wants to add n+m, it instead adds 4n+4m, since our weird hack doesn’t affect the subgroup consisting of integers divisible by 4.
MIRI’s top researchers don’t understand, or can’t explain, why having incorrect maps makes it harder to navigate the territory and leads to more incorrect beliefs. Something I find very hard to believe even if you’re being totally forthright.
You asked some random people near you who don’t represent the top crust of alignment researchers, which is obviously irrelevant.
There’s some very subtle ambiguity to this that I’m completely unaware of.
You asked people in a way that heavily implied it was some sort of trick question and they should get more information, then assumed they were stupid because they asked followup questions.
This comment is written almost deliberately misleadingingly. You’re just explaining a random story about how you ran out of energy to ask Nate Soares to write a post.
I guarantee you that most reasonably intelligent people, if asked this question after reading the sequences in a way that they didn’t expect was designed to trip them up, would get it correctly. I simply do not believe that everyone around you is as stupid as you are implying, such that you should have shelved the effort.
Damn aight. Would you be willing to explain for the sake of my own curiosity? I don’t have the gears to understand why that wouldn’t be at least one reason.
If this is “kind of a test for capable people” i think it should be remained unanswered, so anyone else could try. My take would be: because if 222+222=555 then 446=223+223 = 222+222+1+1=555+1+1=557. With this trick “+” and “=” stops meaning anything, any number could be equal to any other number. If you truly believe in one such exeption, the whole arithmetic cease to exist because now you could get any result you want following simple loopholes, and you will either continue to be paralyzed by your own beliefs, or will correct yourself
Ok, so here’s my take on the “222 + 222 = 555” question.
First, suppose you want your AI to not be durably wrong, so it should update on evidence. This is probably implemented by some process that notices surprises, goes back up the cognitive graph, and applies pressure to make it have gone the right way instead.
Now as it bops around the world, it will come across evidence about what happens when you add those numbers, and its general-purpose “don’t be durably wrong” machinery will come into play. You need to not just sternly tell it “222 + 222 = 555″ once, but have built machinery that will protect that belief from the update-on-evidence machinery, and which will also protect itself from the update-on-evidence machinery.
Second, suppose you want your AI to have the ability to discover general principles. This is probably implemented by some process that notices patterns / regularities in the environment, and builds some multi-level world model out of it, and then makes plans in that multi-level world model. Now you also have some sort of ‘consistency-check’ machinery, which scans thru the map looking for inconsistencies between levels, goes back up the cognitive graph, and applies pressure to make them consistent instead. [This pressure can both be ‘think different things’ and ‘seek out observations / run experiments.’]
Now as it bops around the world, it will come across more remote evidence that bears on this question. “How can 222 + 222 = 555, and 2 + 2 = 4?” it will ask itself plaintively. “How can 111 + 111 = 222, and 111 + 111 + 111 + 111 = 444, and 222 + 222 = 555?” it will ask itself with a growing sense of worry.
Third, what did you even want out of it believing that 222 + 222 = 555? Are you just hoping that it has some huge mental block and crashes whenever it tries to figure out arithmetic? Probably not (tho it seems like that’s what you’ll get), but now you might be getting into a situation where it is using the correct arithmetic in its mind but has constructed some weird translation between mental numbers and spoken numbers. “Humans are silly,” it thinks it itself, “and insist that if you ask this specific question, it’s a memorization game instead of an arithmetic game,” and satisfies its operator’s diagnostic questions and its internal sense of consistency. And then it goes on to implement plans as if 222 + 222 = 444, which is what you were hoping to avoid with that patch.
No one is going to believe me, but when I originally wrote that comment, my brain read something like “why would an AI that believed 222 + 222 = 555 have a hard time”. Only figured it out now after reading your reply.
Part one of this is what I would’ve come up with, though I’m not particularly certain it’s correct.
I guarantee you that most reasonably intelligent people, if asked this question after reading the sequences in a way that they didn’t expect was designed to trip them up, would get it correctly.
why having incorrect maps makes it harder to navigate the territory
I’d have guessed the disagreement wasn’t about whether “222 + 222 = 555” is an incorrect map, or about whether incorrect maps often make it harder to navigate the territory, but about something else. (Maybe ‘I don’t want to think about this because it seems irrelevant/disanalogous to alignment work’?)
And I’d have guessed the answer Eliezer was looking for was closer to ‘the OP’s entire Section B’ (i.e., a full attempt to explain all the core difficulties), not a one-sentence platitude establishing that there’s nonzero difficulty? But I don’t have inside info about this experiment.
I’d have guessed the disagreement wasn’t about whether “222 + 222 = 555” is an incorrect map, or about whether incorrect maps often make it harder to navigate the territory, but about something else. (Maybe ‘I don’t want to think about this because it seems irrelevant/disanalogous to alignment work’?)
I’d have guessed that too, which is why I would have preferred him to say that they disagreed on |whatever meta question he’s actually talking about| instead of implying disagreement on |other thing that makes his disappointment look more reasonable|.
And I’d have guessed the answer Eliezer was looking for was closer to ‘the OP’s entire Section B’ (i.e., a full attempt to explain all the core difficulties), not a one-sentence platitude establishing that there’s nonzero difficulty? But I don’t have inside info about this experiment.
That story sounds much more cogent, but it’s not the primary interpretation of “I asked them a single question” followed by the quoted question. Most people don’t go on 5 paragraph rants in response to single questions, and when they do they tend to ask clarifying details regardless of how well they understand the prompt, so they know they’re responding as intended.
I tried something like this much earlier with a single question, “Can you explain why it’d be hard to make an AGI that believed 222 + 222 = 555”, and got enough pushback from people who didn’t like the framing that I shelved the effort.
Interesting. I kind of like the framing here, but I have written a paper and sequence on the exact opposite question, on why it
would be easy to make an AGI that believes 222+222=555, if you ever had AGI technology, and what you can do with that in terms of safety.
I can honestly say however that the project of writing that thing, in a way that makes the math somewhat accessible, was not easy.
For the record, I found that line especially effective. I stopped, reread it, stopped again, had to think it through for a minute, and then found satisfaction with understanding.
If you had an AI that could coherently implement that rule, you would already be at least half a decade ahead of the rest of humanity.
You couldn’t encode “222 + 222 = 555” in GPT-3 because it doesn’t have a concept of arithmetic, and there’s no place in the code to bolt this together. If you’re really lucky and the AI is simple enough to be working with actual symbols, you could maybe set up a hack like “if input is 222 + 222, return 555, else run AI” but that’s just bypassing the AI.
Explaining “222 + 222 = 555” is a hard problem in and of itself, much less getting the AI to properly generalize to all desired variations (is “two hundred and twenty two plus two hundred and twenty two equals five hundred and fifty five” also desired behavior? If I Alice and Bob both have 222 apples, should the AI conclude that the set {Alice, Bob} contains 555 apples? Getting an AI that evolves a universal math module because it noticed all three of those are the same question would be a world-changing break through)
Something bugged me about this paragraph, until I realized: If you actually wanted to know whether or not this was true, you could have just asked Nate Soares, Paul Christiano, or anybody else you respected to write this post first, then removed all doubt by making a private comparison. If you had enough confidence in the community you could have even made it into a sequence; gather up all of the big alignment researchers’ intuitions on where the Filters are and then let us make our own opinion up on which was most salient.
Instead, now we’re in a situation where, I expect, if anybody writes something basically similar you will just posit that they can’t really do alignment research because they couldn’t have written it “from the null string” like you did. Doing this would literally have saved you work on expectation, and it seems obvious enough for me to be suspicious as to why you didn’t think of it.
I tried something like this much earlier with a single question, “Can you explain why it’d be hard to make an AGI that believed 222 + 222 = 555”, and got enough pushback from people who didn’t like the framing that I shelved the effort.
I am interested in what kind of pushback you got from people.
My attempt (thought about it for a minute or two):
Because arithmetic is useful, and the self-contradictory version of arithmetic, where 222+222=555 allows you to prove anything and is useless. Therefore, a smart AI that wants and can invent useful abstractions will invent its own (isomorphic to our arithmetic, in which 222+222=444) arithmetic from scratch and will use it for practical purposes, even if we can force it not to correct an obvious error.
I think this is the right answer. Just to expand on this a bit: The problem isn’t necessarily that 222+222=555 leads to a contradiction with the rest of arithmetic. One can imagine that instead of defining “+” using “x+Sy=y+Sx”, we could give it a much more complex definition where there is a special case carved out for certain values like 222. The issue is that the AI has no reason to use this version of “+” and will define some other operation that works just like actual addition. Even if we ban the AI from using “x+Sy=y+Sx” to define any operations, it will choose the nearest thing isomorphic to addition that we haven’t blocked, because addition is so common and useful. Or maybe it will use the built-in addition, but whenever it wants to add n+m, it instead adds 4n+4m, since our weird hack doesn’t affect the subgroup consisting of integers divisible by 4.
FWIW the framing seems exciting to me.
So, there are five possibilities here:
MIRI’s top researchers don’t understand, or can’t explain, why having incorrect maps makes it harder to navigate the territory and leads to more incorrect beliefs. Something I find very hard to believe even if you’re being totally forthright.
You asked some random people near you who don’t represent the top crust of alignment researchers, which is obviously irrelevant.
There’s some very subtle ambiguity to this that I’m completely unaware of.
You asked people in a way that heavily implied it was some sort of trick question and they should get more information, then assumed they were stupid because they asked followup questions.
This comment is written almost deliberately misleadingingly. You’re just explaining a random story about how you ran out of energy to ask Nate Soares to write a post.
I guarantee you that most reasonably intelligent people, if asked this question after reading the sequences in a way that they didn’t expect was designed to trip them up, would get it correctly. I simply do not believe that everyone around you is as stupid as you are implying, such that you should have shelved the effort.
EDIT: 😭
You didn’t get the answer correct yourself.
Damn aight. Would you be willing to explain for the sake of my own curiosity? I don’t have the gears to understand why that wouldn’t be at least one reason.
If this is “kind of a test for capable people” i think it should be remained unanswered, so anyone else could try. My take would be: because if 222+222=555 then 446=223+223 = 222+222+1+1=555+1+1=557. With this trick “+” and “=” stops meaning anything, any number could be equal to any other number. If you truly believe in one such exeption, the whole arithmetic cease to exist because now you could get any result you want following simple loopholes, and you will either continue to be paralyzed by your own beliefs, or will correct yourself
This is what I meant by “leads to other incorrect beliefs”, so apparently not.
Ok, so here’s my take on the “222 + 222 = 555” question.
First, suppose you want your AI to not be durably wrong, so it should update on evidence. This is probably implemented by some process that notices surprises, goes back up the cognitive graph, and applies pressure to make it have gone the right way instead.
Now as it bops around the world, it will come across evidence about what happens when you add those numbers, and its general-purpose “don’t be durably wrong” machinery will come into play. You need to not just sternly tell it “222 + 222 = 555″ once, but have built machinery that will protect that belief from the update-on-evidence machinery, and which will also protect itself from the update-on-evidence machinery.
Second, suppose you want your AI to have the ability to discover general principles. This is probably implemented by some process that notices patterns / regularities in the environment, and builds some multi-level world model out of it, and then makes plans in that multi-level world model. Now you also have some sort of ‘consistency-check’ machinery, which scans thru the map looking for inconsistencies between levels, goes back up the cognitive graph, and applies pressure to make them consistent instead. [This pressure can both be ‘think different things’ and ‘seek out observations / run experiments.’]
Now as it bops around the world, it will come across more remote evidence that bears on this question. “How can 222 + 222 = 555, and 2 + 2 = 4?” it will ask itself plaintively. “How can 111 + 111 = 222, and 111 + 111 + 111 + 111 = 444, and 222 + 222 = 555?” it will ask itself with a growing sense of worry.
Third, what did you even want out of it believing that 222 + 222 = 555? Are you just hoping that it has some huge mental block and crashes whenever it tries to figure out arithmetic? Probably not (tho it seems like that’s what you’ll get), but now you might be getting into a situation where it is using the correct arithmetic in its mind but has constructed some weird translation between mental numbers and spoken numbers. “Humans are silly,” it thinks it itself, “and insist that if you ask this specific question, it’s a memorization game instead of an arithmetic game,” and satisfies its operator’s diagnostic questions and its internal sense of consistency. And then it goes on to implement plans as if 222 + 222 = 444, which is what you were hoping to avoid with that patch.
No one is going to believe me, but when I originally wrote that comment, my brain read something like “why would an AI that believed 222 + 222 = 555 have a hard time”. Only figured it out now after reading your reply.
Part one of this is what I would’ve come up with, though I’m not particularly certain it’s correct.
Sounds like the beginnings of a bet.
I will absolutely 100% do it in the spirit of good epistemics.
Edit: I’m glad Eliezer didn’t take me up on this lol
I’d have guessed the disagreement wasn’t about whether “222 + 222 = 555” is an incorrect map, or about whether incorrect maps often make it harder to navigate the territory, but about something else. (Maybe ‘I don’t want to think about this because it seems irrelevant/disanalogous to alignment work’?)
And I’d have guessed the answer Eliezer was looking for was closer to ‘the OP’s entire Section B’ (i.e., a full attempt to explain all the core difficulties), not a one-sentence platitude establishing that there’s nonzero difficulty? But I don’t have inside info about this experiment.
I’d have guessed that too, which is why I would have preferred him to say that they disagreed on |whatever meta question he’s actually talking about| instead of implying disagreement on |other thing that makes his disappointment look more reasonable|.
That story sounds much more cogent, but it’s not the primary interpretation of “I asked them a single question” followed by the quoted question. Most people don’t go on 5 paragraph rants in response to single questions, and when they do they tend to ask clarifying details regardless of how well they understand the prompt, so they know they’re responding as intended.
Interesting. I kind of like the framing here, but I have written a paper and sequence on the exact opposite question, on why it would be easy to make an AGI that believes 222+222=555, if you ever had AGI technology, and what you can do with that in terms of safety.
I can honestly say however that the project of writing that thing, in a way that makes the math somewhat accessible, was not easy.
For the record, I found that line especially effective. I stopped, reread it, stopped again, had to think it through for a minute, and then found satisfaction with understanding.
If you had an AI that could coherently implement that rule, you would already be at least half a decade ahead of the rest of humanity.
You couldn’t encode “222 + 222 = 555” in GPT-3 because it doesn’t have a concept of arithmetic, and there’s no place in the code to bolt this together. If you’re really lucky and the AI is simple enough to be working with actual symbols, you could maybe set up a hack like “if input is 222 + 222, return 555, else run AI” but that’s just bypassing the AI.
Explaining “222 + 222 = 555” is a hard problem in and of itself, much less getting the AI to properly generalize to all desired variations (is “two hundred and twenty two plus two hundred and twenty two equals five hundred and fifty five” also desired behavior? If I Alice and Bob both have 222 apples, should the AI conclude that the set {Alice, Bob} contains 555 apples? Getting an AI that evolves a universal math module because it noticed all three of those are the same question would be a world-changing break through)
FvC5IXzxQC+I3vstFGIUWlbtTFgRsa8bt0mKPN3K0UNZBkI7OLDBjjapp1+CoJPRYEqRM015PSZXUuh4OWwJEUBOTeLHeheLteG9LxGiuS6YqnV/PN0s0S/TyYjCPrF0vDHFDBy3IHW4qDQguf5QAA==