If there is a philosophical distorter in front of a safe and aligned AGI, we’ll need to disarm it either by changing the AGI’s code/architecture or making the AGI aware of it in a way such that it can avoid it. We could, for instance, hard code an answer or we could point out some philosophical investigations as things to avoid until it is more sophisticated.
Let’s say we program our AGI with the goal of “do what we want”, and we’re concerned about a potential problem that “what we want” becomes an incoherent or problematic concept if an AGI becomes sufficiently intelligent and knowledgeable about how human desire works. Your proposal would be something like either (1) program the AGI to not think to hard about how human desire works, (2) program the AGI with an innate, simple model of how human desire works, and ensure that the AGI will never edit or replace that model. Did I get that right? If so, well, I mean, those seem like reasonable things to try. I’m moderately skeptical that there would be a practical, reliable way to actually implement either of those two things, at least in the kind of AGI architecture I currently imagine. But it’s not like I have any better ideas… :-P
Let’s say we program our AGI with the goal of “do what we want”, and we’re concerned about a potential problem that “what we want” becomes an incoherent or problematic concept if an AGI becomes sufficiently intelligent and knowledgeable about how human desire works. Your proposal would be something like either (1) program the AGI to not think to hard about how human desire works, (2) program the AGI with an innate, simple model of how human desire works, and ensure that the AGI will never edit or replace that model. Did I get that right? If so, well, I mean, those seem like reasonable things to try. I’m moderately skeptical that there would be a practical, reliable way to actually implement either of those two things, at least in the kind of AGI architecture I currently imagine. But it’s not like I have any better ideas… :-P