Ben Byford

Karma: 1

I run the Machine Ethics Podcast and talk on AI, games and ethics subjects

www.machine-ethics.net
www.benbyford.com

Ben Byford Oct 9, 2024, 1:03 PM
LW: 2 AF: 1
0
AF
on: Self-Other Overlap: A Neglected Approach to AI Alignment
Hi! Really interesting.
I’ve recently been thinking (along with a close colleague) alot about empathy, the amygdela and our models of self and others as a safety mechanism for narrow AIs, (or on general AI’s for specific narrow domains e.g. validating behaviour on a set of tasks, not all task).
I found this article by listening to The Cognative Revolution podcast. Stop me if I’m wrong but I believe your toy example setup was differently described above than it was on the show. Here you mention a reward for the blue if near the green goal AND OR the adversary is near the black—on the show you mentioned a negative reward to the blue if the adversary is near the green and not anything if near the black… anyway as above I’m not sure you could call it deceoption as its just two different goals: one with two spots that give reward and another with one… if its was more like the podcast version where the blue actor gets negative reward by the adversary being close to the goal then that makes more sense and maybe make that explicit above (or in your paper).
Moreover, averaging activations across two instances of a network sounds more like forgetting or unlearning and its not clear without more examples whether it is unlearning / forgetting or doing something else as you seem to indecate.
Equally, this seems like a narrow method as you would actively need data, a model or an environment that represented a set of hand picked outcomes you would like to cross compare… i.e. not sure its obvious that this would generalise. However, it might be a good method for specifically unlearning behavior that you already know is problematic for a model instead in a certain domain (instead of human RL).
On the semantics (and coming from the AI Ethics and philosophy worlds), it seems a little early to be talking about “self”, “deception”, and “other” in this context as theres nothing above that seems to be similar. The toy example isn’t as formulated deceptive, (deception indecates the actor has goals in and of itself, not just directed by your optimisation which is currently the case), and self seems to indecate that the actor has a way of discriminating between its environment and eveything else which i believe the toy example also doesnt feel very satisfying… maybe other terms might not get people heckles up ;) might just be me though.
anyways, interesting and happy to chat more.