can’t we just fine-tune the language model by training it on statements like “If (authorized) humans want to turn me off, I should turn off.”
are that these are not well defined, and if you let a human (like me) read it, I will automatically fill in the blanks to probably match your own intuition.
As examples of problems:
“I should turn off”
who is “I”? what if the AI makes another one?
What is “should”? Does the AI get utility from this or not? If so, Will the AI try to convince the humans to turning it off? If not, will the AI try to prevent humans from WANTING to turn it off?
Hey!
One of the problems in
are that these are not well defined, and if you let a human (like me) read it, I will automatically fill in the blanks to probably match your own intuition.
As examples of problems:
“I should turn off”
who is “I”? what if the AI makes another one?
What is “should”? Does the AI get utility from this or not? If so, Will the AI try to convince the humans to turning it off? If not, will the AI try to prevent humans from WANTING to turn it off?