Yeah I mean it’s fine to start out without the shoggoth/face split and then add it in later after testing.
I agree that I don’t have a strong reason to think that the bad deceptive skills will accumulate in the Face and not the Shoggoth. However, there are some weak reasons, detailed in this other comment. My bid would be: Let’s build it and test it! This is something we don’t have to speculate about, we can try it and see what happens. Obviously we shouldn’t roll it out to a flagship user-facing model until we’ve tested it at smaller scale.
Testing sounds great though I’d note that the way I’d approach testing is to first construct a general test bed that produces problematic behavior through training incentives for o1-like models. (Possibly building on deepseek’s version of o1.) Then I’d move to trying a bunch of stuff in this setting.
(I assume you agree, but thought this would be worth emphasizing to third parties.)
Yeah I mean it’s fine to start out without the shoggoth/face split and then add it in later after testing.
I agree that I don’t have a strong reason to think that the bad deceptive skills will accumulate in the Face and not the Shoggoth. However, there are some weak reasons, detailed in this other comment. My bid would be: Let’s build it and test it! This is something we don’t have to speculate about, we can try it and see what happens. Obviously we shouldn’t roll it out to a flagship user-facing model until we’ve tested it at smaller scale.
Testing sounds great though I’d note that the way I’d approach testing is to first construct a general test bed that produces problematic behavior through training incentives for o1-like models. (Possibly building on deepseek’s version of o1.) Then I’d move to trying a bunch of stuff in this setting.
(I assume you agree, but thought this would be worth emphasizing to third parties.)