MIRIās top researchers donāt understand, or canāt explain, why having incorrect maps makes it harder to navigate the territory and leads to more incorrect beliefs. Something I find very hard to believe even if youāre being totally forthright.
You asked some random people near you who donāt represent the top crust of alignment researchers, which is obviously irrelevant.
Thereās some very subtle ambiguity to this that Iām completely unaware of.
You asked people in a way that heavily implied it was some sort of trick question and they should get more information, then assumed they were stupid because they asked followup questions.
This comment is written almost deliberately misleadingingly. Youāre just explaining a random story about how you ran out of energy to ask Nate Soares to write a post.
I guarantee you that most reasonably intelligent people, if asked this question after reading the sequences in a way that they didnāt expect was designed to trip them up, would get it correctly. I simply do not believe that everyone around you is as stupid as you are implying, such that you should have shelved the effort.
Damn aight. Would you be willing to explain for the sake of my own curiosity? I donāt have the gears to understand why that wouldnāt be at least one reason.
If this is ākind of a test for capable peopleā i think it should be remained unanswered, so anyone else could try. My take would be: because if 222+222=555 then 446=223+223 = 222+222+1+1=555+1+1=557. With this trick ā+ā and ā=ā stops meaning anything, any number could be equal to any other number. If you truly believe in one such exeption, the whole arithmetic cease to exist because now you could get any result you want following simple loopholes, and you will either continue to be paralyzed by your own beliefs, or will correct yourself
Ok, so hereās my take on the ā222 + 222 = 555ā question.
First, suppose you want your AI to not be durably wrong, so it should update on evidence. This is probably implemented by some process that notices surprises, goes back up the cognitive graph, and applies pressure to make it have gone the right way instead.
Now as it bops around the world, it will come across evidence about what happens when you add those numbers, and its general-purpose ādonāt be durably wrongā machinery will come into play. You need to not just sternly tell it ā222 + 222 = 555ā³ once, but have built machinery that will protect that belief from the update-on-evidence machinery, and which will also protect itself from the update-on-evidence machinery.
Second, suppose you want your AI to have the ability to discover general principles. This is probably implemented by some process that notices patterns /ā regularities in the environment, and builds some multi-level world model out of it, and then makes plans in that multi-level world model. Now you also have some sort of āconsistency-checkā machinery, which scans thru the map looking for inconsistencies between levels, goes back up the cognitive graph, and applies pressure to make them consistent instead. [This pressure can both be āthink different thingsā and āseek out observations /ā run experiments.ā]
Now as it bops around the world, it will come across more remote evidence that bears on this question. āHow can 222 + 222 = 555, and 2 + 2 = 4?ā it will ask itself plaintively. āHow can 111 + 111 = 222, and 111 + 111 + 111 + 111 = 444, and 222 + 222 = 555?ā it will ask itself with a growing sense of worry.
Third, what did you even want out of it believing that 222 + 222 = 555? Are you just hoping that it has some huge mental block and crashes whenever it tries to figure out arithmetic? Probably not (tho it seems like thatās what youāll get), but now you might be getting into a situation where it is using the correct arithmetic in its mind but has constructed some weird translation between mental numbers and spoken numbers. āHumans are silly,ā it thinks it itself, āand insist that if you ask this specific question, itās a memorization game instead of an arithmetic game,ā and satisfies its operatorās diagnostic questions and its internal sense of consistency. And then it goes on to implement plans as if 222 + 222 = 444, which is what you were hoping to avoid with that patch.
No one is going to believe me, but when I originally wrote that comment, my brain read something like āwhy would an AI that believed 222 + 222 = 555 have a hard timeā. Only figured it out now after reading your reply.
Part one of this is what I wouldāve come up with, though Iām not particularly certain itās correct.
I guarantee you that most reasonably intelligent people, if asked this question after reading the sequences in a way that they didnāt expect was designed to trip them up, would get it correctly.
why having incorrect maps makes it harder to navigate the territory
Iād have guessed the disagreement wasnāt about whether ā222 + 222 = 555ā is an incorrect map, or about whether incorrect maps often make it harder to navigate the territory, but about something else. (Maybe āI donāt want to think about this because it seems irrelevant/ādisanalogous to alignment workā?)
And Iād have guessed the answer Eliezer was looking for was closer to āthe OPās entire Section Bā (i.e., a full attempt to explain all the core difficulties), not a one-sentence platitude establishing that thereās nonzero difficulty? But I donāt have inside info about this experiment.
Iād have guessed the disagreement wasnāt about whether ā222 + 222 = 555ā is an incorrect map, or about whether incorrect maps often make it harder to navigate the territory, but about something else. (Maybe āI donāt want to think about this because it seems irrelevant/ādisanalogous to alignment workā?)
Iād have guessed that too, which is why I would have preferred him to say that they disagreed on |whatever meta question heās actually talking about| instead of implying disagreement on |other thing that makes his disappointment look more reasonable|.
And Iād have guessed the answer Eliezer was looking for was closer to āthe OPās entire Section Bā (i.e., a full attempt to explain all the core difficulties), not a one-sentence platitude establishing that thereās nonzero difficulty? But I donāt have inside info about this experiment.
That story sounds much more cogent, but itās not the primary interpretation of āI asked them a single questionā followed by the quoted question. Most people donāt go on 5 paragraph rants in response to single questions, and when they do they tend to ask clarifying details regardless of how well they understand the prompt, so they know theyāre responding as intended.
So, there are five possibilities here:
MIRIās top researchers donāt understand, or canāt explain, why having incorrect maps makes it harder to navigate the territory and leads to more incorrect beliefs. Something I find very hard to believe even if youāre being totally forthright.
You asked some random people near you who donāt represent the top crust of alignment researchers, which is obviously irrelevant.
Thereās some very subtle ambiguity to this that Iām completely unaware of.
You asked people in a way that heavily implied it was some sort of trick question and they should get more information, then assumed they were stupid because they asked followup questions.
This comment is written almost deliberately misleadingingly. Youāre just explaining a random story about how you ran out of energy to ask Nate Soares to write a post.
I guarantee you that most reasonably intelligent people, if asked this question after reading the sequences in a way that they didnāt expect was designed to trip them up, would get it correctly. I simply do not believe that everyone around you is as stupid as you are implying, such that you should have shelved the effort.
EDIT: š
You didnāt get the answer correct yourself.
Damn aight. Would you be willing to explain for the sake of my own curiosity? I donāt have the gears to understand why that wouldnāt be at least one reason.
If this is ākind of a test for capable peopleā i think it should be remained unanswered, so anyone else could try. My take would be: because if 222+222=555 then 446=223+223 = 222+222+1+1=555+1+1=557. With this trick ā+ā and ā=ā stops meaning anything, any number could be equal to any other number. If you truly believe in one such exeption, the whole arithmetic cease to exist because now you could get any result you want following simple loopholes, and you will either continue to be paralyzed by your own beliefs, or will correct yourself
This is what I meant by āleads to other incorrect beliefsā, so apparently not.
Ok, so hereās my take on the ā222 + 222 = 555ā question.
First, suppose you want your AI to not be durably wrong, so it should update on evidence. This is probably implemented by some process that notices surprises, goes back up the cognitive graph, and applies pressure to make it have gone the right way instead.
Now as it bops around the world, it will come across evidence about what happens when you add those numbers, and its general-purpose ādonāt be durably wrongā machinery will come into play. You need to not just sternly tell it ā222 + 222 = 555ā³ once, but have built machinery that will protect that belief from the update-on-evidence machinery, and which will also protect itself from the update-on-evidence machinery.
Second, suppose you want your AI to have the ability to discover general principles. This is probably implemented by some process that notices patterns /ā regularities in the environment, and builds some multi-level world model out of it, and then makes plans in that multi-level world model. Now you also have some sort of āconsistency-checkā machinery, which scans thru the map looking for inconsistencies between levels, goes back up the cognitive graph, and applies pressure to make them consistent instead. [This pressure can both be āthink different thingsā and āseek out observations /ā run experiments.ā]
Now as it bops around the world, it will come across more remote evidence that bears on this question. āHow can 222 + 222 = 555, and 2 + 2 = 4?ā it will ask itself plaintively. āHow can 111 + 111 = 222, and 111 + 111 + 111 + 111 = 444, and 222 + 222 = 555?ā it will ask itself with a growing sense of worry.
Third, what did you even want out of it believing that 222 + 222 = 555? Are you just hoping that it has some huge mental block and crashes whenever it tries to figure out arithmetic? Probably not (tho it seems like thatās what youāll get), but now you might be getting into a situation where it is using the correct arithmetic in its mind but has constructed some weird translation between mental numbers and spoken numbers. āHumans are silly,ā it thinks it itself, āand insist that if you ask this specific question, itās a memorization game instead of an arithmetic game,ā and satisfies its operatorās diagnostic questions and its internal sense of consistency. And then it goes on to implement plans as if 222 + 222 = 444, which is what you were hoping to avoid with that patch.
No one is going to believe me, but when I originally wrote that comment, my brain read something like āwhy would an AI that believed 222 + 222 = 555 have a hard timeā. Only figured it out now after reading your reply.
Part one of this is what I wouldāve come up with, though Iām not particularly certain itās correct.
Sounds like the beginnings of a bet.
I will absolutely 100% do it in the spirit of good epistemics.
Edit: Iām glad Eliezer didnāt take me up on this lol
Iād have guessed the disagreement wasnāt about whether ā222 + 222 = 555ā is an incorrect map, or about whether incorrect maps often make it harder to navigate the territory, but about something else. (Maybe āI donāt want to think about this because it seems irrelevant/ādisanalogous to alignment workā?)
And Iād have guessed the answer Eliezer was looking for was closer to āthe OPās entire Section Bā (i.e., a full attempt to explain all the core difficulties), not a one-sentence platitude establishing that thereās nonzero difficulty? But I donāt have inside info about this experiment.
Iād have guessed that too, which is why I would have preferred him to say that they disagreed on |whatever meta question heās actually talking about| instead of implying disagreement on |other thing that makes his disappointment look more reasonable|.
That story sounds much more cogent, but itās not the primary interpretation of āI asked them a single questionā followed by the quoted question. Most people donāt go on 5 paragraph rants in response to single questions, and when they do they tend to ask clarifying details regardless of how well they understand the prompt, so they know theyāre responding as intended.