ChatGPT is basically RLHF’d to approximate the responses of a human roleplaying as an AI assistant. So this proves that… it can roleplay a human roleplaying as an AI assistant roleplaying as a different AI in such a way that said different AI exhibits credit hacking behaviors.
I think we already had an example in gridworlds where an AI refuses to go on a hex that would change its reward function, even though it gives it a higher reward, but that might have just been a thought experiment.
ChatGPT is basically RLHF’d to approximate the responses of a human roleplaying as an AI assistant. So this proves that… it can roleplay a human roleplaying as an AI assistant roleplaying as a different AI in such a way that said different AI exhibits credit hacking behaviors.
I think we already had an example in gridworlds where an AI refuses to go on a hex that would change its reward function, even though it gives it a higher reward, but that might have just been a thought experiment.
Agreed, this was an expected result. It’s nice to have a functioning example to point to for LLMs in an RLHF context, though.