I know that the analogy is not in any way precise, but… isn’t the whole Alignment problem, metaphorically, an attempt to resolve a Moorean statement?
“I know that the humans forced to smile are not happy (and I know all the mistakes they’ve made while programming me, I know what they should’ve done instead), but I don’t believe that they are not happy.”
Another alternative view, due to Richard Moran,[15] views the existence of Moore’s paradox as symptomatic of creatures who are capable of self-knowledge, capable of thinking for themselves from a deliberative point of view, as well as about themselves from a theoretical point of view. On this view, anyone who asserted or believed one of Moore’s sentences would be subject to a loss of self-knowledge—in particular, would be one who, with respect to a particular ‘object’, broadly construed, e.g. person, apple, the way of the world, would be in a situation which violates, what Moran calls, the Transparency Condition: if I want to know what I think about X, then I consider/think about nothing but X itself. Moran’s view seems to be that what makes Moore’s paradox so distinctive is not some contradictory-like phenomenon (or at least not in the sense that most commentators on the problem have construed it), whether it be located at the level of belief or that of assertion. Rather, that the very possibility of Moore’s paradox is a consequence of our status as agents (albeit finite and resource-limited ones) who are capable of knowing (and changing) their own minds.
Doesn’t “solving Alignment” mean creating some sort of “Transparency Condition”? Maybe such conditions are the key to having a human-like consciousness and ability to think about your own goals.
I know that the humans forced to smile are not happy (and I know all the mistakes they’ve made while programming me, I know what they should’ve done instead), but I don’t believe that they are not happy.
These are different senses of “happy.” It should really read:
I know forcing humans to smile doesn’t make them happyhumane, and I know what they should’ve written instead to get me to optimize for happyhumane as they intended, but they are happysmiley.
They’re different concepts, so there’s no strangeness here. The AGI knows what you meant to do, it just cares about the different thing you accidently instilled in it, and so doesn’t care about what you wanted.
I know that there’s no strangeness from the formal point of view. But it doesn’t mean there’s no strangeness in general. Or that the situation isn’t similar to the Moore paradox. Your examples are not 100% Moore statements too. Isn’t the point of the discussion to find interesting connections between Moore paradox and other things?
The AGI knows what you meant to do, it just cares about the different thing you accidently instilled in it, and so doesn’t care about what you wanted.
I know that the classical way to formulate it is “AI knows, but doesn’t care”.
I thought it may be interesting to formulate it as “AI knows, but doesn’t believe”. It may be interesting to think for what type of AI this formulation may be true. For such AI alignment would mean resolving the Moore paradox. For example, imagine an AI with a very strong OCD to make people smile.
I know that the analogy is not in any way precise, but… isn’t the whole Alignment problem, metaphorically, an attempt to resolve a Moorean statement?
“I know that the humans forced to smile are not happy (and I know all the mistakes they’ve made while programming me, I know what they should’ve done instead), but I don’t believe that they are not happy.”
Here’s an interesting bit from wikipedia:
https://en.wikipedia.org/wiki/Moore%27s_paradox#Proposed_explanations
Doesn’t “solving Alignment” mean creating some sort of “Transparency Condition”? Maybe such conditions are the key to having a human-like consciousness and ability to think about your own goals.
These are different senses of “happy.” It should really read:
They’re different concepts, so there’s no strangeness here. The AGI knows what you meant to do, it just cares about the different thing you accidently instilled in it, and so doesn’t care about what you wanted.
I know that there’s no strangeness from the formal point of view. But it doesn’t mean there’s no strangeness in general. Or that the situation isn’t similar to the Moore paradox. Your examples are not 100% Moore statements too. Isn’t the point of the discussion to find interesting connections between Moore paradox and other things?
I know that the classical way to formulate it is “AI knows, but doesn’t care”.
I thought it may be interesting to formulate it as “AI knows, but doesn’t believe”. It may be interesting to think for what type of AI this formulation may be true. For such AI alignment would mean resolving the Moore paradox. For example, imagine an AI with a very strong OCD to make people smile.