I’m working on a pc & mobile game about metaethics, ethics & AI alignment. (Our steam page + announcement teaser will be up in 1-2 weeks.) It’s important to me that we nail the AI alignment part and give people a good idea about why AI alignment is hard and optimism shouldn’t be our default position.
Some constraints that come with the medium:
We can’t go into arguments that are too technical
We need to keep the exposition of these ideas short
What do you think of the following framing of why AI alignment is hard? Are we missing any crucial considerations? (This is merely our internal script and will later be turned into more digestible dialogues, mini games, etc.)
There are 2 ways alignment can pan out: Either the values are programmed into the system, or they are learned by the system.
1) Values are programmed into the system. But:
There is no agreement on which values.
All ethical theories we have use ontologies (valence, pain, phenomenology, rights, justice) that don’t easily translate into code.
All the ethical theories we have (including those on moral uncertainty) have *some* cases (often in the area of population ethics) where most people disagree with their results.
In practice we usually use acceptable proxy values when designing AI rewards (maximize link clicks instead of maximizing user value). But using proxy values is extremely risky for sufficiently powerful optimizers. They might find optimized world states which massively neglect our actual values.
2) Values are learned by the system. But:
There is much disagreement about the right dataset.
Even if we agree on a data set, it is unclear if the AI will generalize from the data set in an acceptable way.
Probably(?) the AI will find the most value in exploits. Exploits are cases where 1) the AI achieves its goals to a high degree, but 2) the AI has generalized poorly (by our standards). Examples: wireheading, goal modification, etc.
Other reasons for concern:
Ethical truths are probably different from empirical truths. An advanced AI may learn empirical truths on its own from enough data, but it seems unlikely that it will automatically converge on the ethical truth. Instead, it seems that any degree of intelligence can be combined with any kind of goal. (Orthogonality Thesis)
There are some instrumental goals that probably many sufficiently advanced intelligences will converge on, for example accumulating resources, preventing others from interfering with its values, trying not to be shut down (e.g. by deceiving others into believing it has the same goals as them), etc. These instrumental goals make an iterative approach (run, test, fix, repeat) problematic.
A Game About AI Alignment (& Meta-Ethics): What Are the Must Haves?
I’m working on a pc & mobile game about metaethics, ethics & AI alignment. (Our steam page + announcement teaser will be up in 1-2 weeks.) It’s important to me that we nail the AI alignment part and give people a good idea about why AI alignment is hard and optimism shouldn’t be our default position.
Some constraints that come with the medium:
We can’t go into arguments that are too technical
We need to keep the exposition of these ideas short
What do you think of the following framing of why AI alignment is hard? Are we missing any crucial considerations? (This is merely our internal script and will later be turned into more digestible dialogues, mini games, etc.)
There are 2 ways alignment can pan out: Either the values are programmed into the system, or they are learned by the system.
1) Values are programmed into the system. But:
There is no agreement on which values.
All ethical theories we have use ontologies (valence, pain, phenomenology, rights, justice) that don’t easily translate into code.
All the ethical theories we have (including those on moral uncertainty) have *some* cases (often in the area of population ethics) where most people disagree with their results.
In practice we usually use acceptable proxy values when designing AI rewards (maximize link clicks instead of maximizing user value). But using proxy values is extremely risky for sufficiently powerful optimizers. They might find optimized world states which massively neglect our actual values.
2) Values are learned by the system. But:
There is much disagreement about the right dataset.
Even if we agree on a data set, it is unclear if the AI will generalize from the data set in an acceptable way.
Probably(?) the AI will find the most value in exploits. Exploits are cases where 1) the AI achieves its goals to a high degree, but 2) the AI has generalized poorly (by our standards). Examples: wireheading, goal modification, etc.
Other reasons for concern:
Ethical truths are probably different from empirical truths. An advanced AI may learn empirical truths on its own from enough data, but it seems unlikely that it will automatically converge on the ethical truth. Instead, it seems that any degree of intelligence can be combined with any kind of goal. (Orthogonality Thesis)
There are some instrumental goals that probably many sufficiently advanced intelligences will converge on, for example accumulating resources, preventing others from interfering with its values, trying not to be shut down (e.g. by deceiving others into believing it has the same goals as them), etc. These instrumental goals make an iterative approach (run, test, fix, repeat) problematic.