Re: the 1st person problem, if we’re thinking of prosaic alignment solutions, a promising one to me is showing the AI labeled videos of itself doing various things, along with whether those things were honest or not.
I think this is basically how I as a human perceive my sense of self? I don’t think I have a good pointer to myself (e.g. out-of-body experiences highlight the difference between my physical body and my mind), but I do have a good pointer to what my friends would describe as myself. In that same way, it seems sort of reasonable to train an AI to define “I am being honest” as “AI Joe exists [and happens to be me], my goal is to maximize the probability that humans who see AI Joe taking action X would say that AI Joe is being honest”.
Then all that remains is showing the AI lots of different situations in which it takes actions along with human labels that “AI Joe just took that action”. Insofar as humans know what constitutes the AI, it seems like the AI could figure out the same definition?
Great post!
Re: the 1st person problem, if we’re thinking of prosaic alignment solutions, a promising one to me is showing the AI labeled videos of itself doing various things, along with whether those things were honest or not.
I think this is basically how I as a human perceive my sense of self? I don’t think I have a good pointer to myself (e.g. out-of-body experiences highlight the difference between my physical body and my mind), but I do have a good pointer to what my friends would describe as myself. In that same way, it seems sort of reasonable to train an AI to define “I am being honest” as “AI Joe exists [and happens to be me], my goal is to maximize the probability that humans who see AI Joe taking action X would say that AI Joe is being honest”.
Then all that remains is showing the AI lots of different situations in which it takes actions along with human labels that “AI Joe just took that action”. Insofar as humans know what constitutes the AI, it seems like the AI could figure out the same definition?