I’m excited for ideas for concrete training set ups that would induce deception2 in an RLHF model, especially in the context of an LLM—I’m excited about people posting any ideas here. :)
I’m excited for ideas for concrete training set ups that would induce deception2 in an RLHF model, especially in the context of an LLM—I’m excited about people posting any ideas here. :)