Is the “cure cancer goal ends up as a nuke humanity action” hypothesis valid and backed by evidence?
My understanding is that the meaning of the “cure cancer” sentence can be represented as a point in a high-dimensional meaning space, which I expect to be pretty far from the “nuke humanity” point.
For example “cure cancer” would be highly associated with saving lots of lives and positive sentiments, while “nuke humanity” would have the exact opposite associations, positioning it far away from “cure cancer”.
A good design might specify that if the two goals are sufficiently far away they are not interchangeable. This could be modeled in the AI as an exponential decrease of the reward based on the distance between the meaning of the goal and the meaning of the action.
Does this make any sense? (I have a feeling I might be mixing concepts coming from different types of AI)
These original warnings were always written from a framework that assumed the only way to make intelligence is RL. They are still valid for RL, but thankfully it seems that at least for the time being, pure RL is not popular; I imagine that might have something to do with how obvious it is to everyone who tries pure RL that it’s pretty hard to get it to do useful things, for reasons that can be reasonably called alignment problems.
Imagine trying to get an AI to cure cancer entirely by RLHF, without even letting it learn language first. That’s how bad they thought it would be.
But RL setups do get used, and they do have generalization issues that do have connection to these issues.
Is the “cure cancer goal ends up as a nuke humanity action” hypothesis valid and backed by evidence?
My understanding is that the meaning of the “cure cancer” sentence can be represented as a point in a high-dimensional meaning space, which I expect to be pretty far from the “nuke humanity” point.
For example “cure cancer” would be highly associated with saving lots of lives and positive sentiments, while “nuke humanity” would have the exact opposite associations, positioning it far away from “cure cancer”.
A good design might specify that if the two goals are sufficiently far away they are not interchangeable. This could be modeled in the AI as an exponential decrease of the reward based on the distance between the meaning of the goal and the meaning of the action.
Does this make any sense? (I have a feeling I might be mixing concepts coming from different types of AI)
These original warnings were always written from a framework that assumed the only way to make intelligence is RL. They are still valid for RL, but thankfully it seems that at least for the time being, pure RL is not popular; I imagine that might have something to do with how obvious it is to everyone who tries pure RL that it’s pretty hard to get it to do useful things, for reasons that can be reasonably called alignment problems.
Imagine trying to get an AI to cure cancer entirely by RLHF, without even letting it learn language first. That’s how bad they thought it would be.
But RL setups do get used, and they do have generalization issues that do have connection to these issues.