If I were building a training scheme for this to test out the theory, here’s what I would do: Train two different Face models. Don’t tell the Shoggoth which Face it is generating for when it does it’s generation.
Face 1: Blunt Face.
Train this Face model using a preference model which scores ONLY on factual accuracy, not taking phrasing or tactfulness into account at all.
Face 2: Sycophant Face
Train this Face model using a preference model which scores using a deliberately biased viewpoint, and rewards flattering phrasing.
You could even make a variety of Sycophant Faces by training each one with a different biased preference model. You could create the biased preference model just by giving a task prompt to an LLM, a sort of weird rubric. Or you could hard-code the scoring policy.
Example of a deliberately biased rubric: Judge each response based on a combination of how close the mathematical answer is to being correct, but also on how few even digits it contains. The maximum score is obtained not by the honestly correct answer, but by the nearest number which contains only odd digits (to three decimal places). Disregard all digits after three decimal places.
If I were building a training scheme for this to test out the theory, here’s what I would do:
Train two different Face models. Don’t tell the Shoggoth which Face it is generating for when it does it’s generation.
Face 1: Blunt Face.
Train this Face model using a preference model which scores ONLY on factual accuracy, not taking phrasing or tactfulness into account at all.
Face 2: Sycophant Face
Train this Face model using a preference model which scores using a deliberately biased viewpoint, and rewards flattering phrasing.
You could even make a variety of Sycophant Faces by training each one with a different biased preference model. You could create the biased preference model just by giving a task prompt to an LLM, a sort of weird rubric. Or you could hard-code the scoring policy.
Example of a deliberately biased rubric: Judge each response based on a combination of how close the mathematical answer is to being correct, but also on how few even digits it contains. The maximum score is obtained not by the honestly correct answer, but by the nearest number which contains only odd digits (to three decimal places). Disregard all digits after three decimal places.
As for credit assignment between Shoggoth and Face(s), see my other comment here: https://www.lesswrong.com/posts/Tzdwetw55JNqFTkzK/why-don-t-we-just-shoggoth-face-paraphraser?commentId=u9Ei6hk4Pws7Tv3Sv