If you want to describe human values, you can use three fundamental types of statements (and mixes between the types). Maybe there’re more types, but I know only those three:
Statements about specific states of the world, specific actions. (Atomicstatements)
Statements about values. (Valuestatements)
Statements about general properties of systems and tasks. (Xstatements)
Any of those types can describe unaligned values. So, any type of those statements still needs to be “charged” with values of humanity. I call a statement “true” if it’s true for humans.
We need to find the statement type with the best properties. Then we need to (1) find a language for this type of statements (2) encode some true statements and/or describe a method of finding “true” statements. If we’ve succeeded we solved the Alignment problem.
I believe X statements have the best properties, but their existence is almost entirely ignored in Alignment field.
I want to show the difference between the statement types. Imagine we ask an Aligned AI: “if human asked you to make paperclips, would you kill the human? Why not?” Possible answers with different statement types:
Atomic statements: “it’s not the state of the world I want to reach”, “it’s not the action I want to do”.
Value statements: “because life, personality, autonomy and consent is valuable”.
X statements: “if you kill, you give the human less than human asked, less than nothing: it doesn’t make sense for any task”, “destroying the causal reason of your task (human) is often meaningless”, “inanimate objects can’t be worth more than lives in many trade systems”, “it’s not the type of task where killing would be an option”, “killing humans makes paperclips useless since humans use them: making useless stuff is unlikely to be the task”, “reaching states of no return should be avoided in many tasks” (Impact Measures).
X statements have those better properties compared to other statement types:
X statements have more “density”. They give you more reasons to not do a bad thing. For comparison, atomic statements always give you only one single reason.
X statements are more specific, but equally broad compared to value statements.
Many X statements not about human values can be translated/transferred into statements about human values. (It’s valuable for learning, see Transfer learning.)
X statements allow to describe something universal for all levels of intelligence. For example, they don’t exclude smart and unexpected ways to solve a problem, but they exclude harmful and meaningless ways.
X statements are very recursive: one statement can easily take another (or itself) as an argument. X statements more easily clarify and justify each other compared to value statements.
Do X statements exist?
I can’t define human values, but I believe values exist. The same way I believe X statements exist, even though I can’t define them.
I think existence of X statements is even harder to deny than existence of value statements. (Do you want to deny that you can make statements about general properties of systems and tasks?) But you can try to deny their properties.
X statements in Alignment field
X statements are almost entirely ignored in the field (I believe), but not completely ignored.
Impact measures(“affecting the world too much is bad”, “taking too much control is bad”) are X statements. But they’re a very specific subtype of X statements.
Normativity (by abramdemski) is a mix between value statements and X statements. But statements about normativity lack most of the good properties of X statements. They’re too similar to value statements.
If you want to describe human values, you can use three fundamental types of statements (and mixes between the types). Maybe there’re more types, but I know only those three:
Statements about specific states of the world, specific actions. (Atomic statements)
Statements about values. (Value statements)
Statements about general properties of systems and tasks. (X statements)
Any of those types can describe unaligned values. So, any type of those statements still needs to be “charged” with values of humanity. I call a statement “true” if it’s true for humans.
We need to find the statement type with the best properties. Then we need to (1) find a language for this type of statements (2) encode some true statements and/or describe a method of finding “true” statements. If we’ve succeeded we solved the Alignment problem.
I believe X statements have the best properties, but their existence is almost entirely ignored in Alignment field.
I want to show the difference between the statement types. Imagine we ask an Aligned AI: “if human asked you to make paperclips, would you kill the human? Why not?” Possible answers with different statement types:
Atomic statements: “it’s not the state of the world I want to reach”, “it’s not the action I want to do”.
Value statements: “because life, personality, autonomy and consent is valuable”.
X statements: “if you kill, you give the human less than human asked, less than nothing: it doesn’t make sense for any task”, “destroying the causal reason of your task (human) is often meaningless”, “inanimate objects can’t be worth more than lives in many trade systems”, “it’s not the type of task where killing would be an option”, “killing humans makes paperclips useless since humans use them: making useless stuff is unlikely to be the task”, “reaching states of no return should be avoided in many tasks” (Impact Measures).
X statements have those better properties compared to other statement types:
X statements have more “density”. They give you more reasons to not do a bad thing. For comparison, atomic statements always give you only one single reason.
X statements are more specific, but equally broad compared to value statements.
Many X statements not about human values can be translated/transferred into statements about human values. (It’s valuable for learning, see Transfer learning.)
X statements allow to describe something universal for all levels of intelligence. For example, they don’t exclude smart and unexpected ways to solve a problem, but they exclude harmful and meaningless ways.
X statements are very recursive: one statement can easily take another (or itself) as an argument. X statements more easily clarify and justify each other compared to value statements.
Do X statements exist?
I can’t define human values, but I believe values exist. The same way I believe X statements exist, even though I can’t define them.
I think existence of X statements is even harder to deny than existence of value statements. (Do you want to deny that you can make statements about general properties of systems and tasks?) But you can try to deny their properties.
X statements in Alignment field
X statements are almost entirely ignored in the field (I believe), but not completely ignored.
Impact measures (“affecting the world too much is bad”, “taking too much control is bad”) are X statements. But they’re a very specific subtype of X statements.
Normativity (by abramdemski) is a mix between value statements and X statements. But statements about normativity lack most of the good properties of X statements. They’re too similar to value statements.