Lao Mein comments on Eliciting Credit Hacking Behaviours in LLMs

Lao Mein 14 Sep 2023 16:19 UTC
2 points
0
ChatGPT is basically RLHF’d to approximate the responses of a human roleplaying as an AI assistant. So this proves that… it can roleplay a human roleplaying as an AI assistant roleplaying as a different AI in such a way that said different AI exhibits credit hacking behaviors.
I think we already had an example in gridworlds where an AI refuses to go on a hex that would change its reward function, even though it gives it a higher reward, but that might have just been a thought experiment.
- omegastick 14 Sep 2023 16:54 UTC
  1 point
  0
  Parent
  Agreed, this was an expected result. It’s nice to have a functioning example to point to for LLMs in an RLHF context, though.