I guess I should make another general remark here.
Yes, using implicit knowledge in your solution would be considered cheating, and bad form, when passing AI system benchmarks which intend to test more generic capabilities.
However, if I were to buy an alignment solution from a startup, then I would prefer to be told that the solution encodes a lot of relevant implicit knowledge about the problem domain. Incorporating such knowledge would no longer be cheating, it would be an expected part of safety engineering.
This seeming contradiction is of course one of these things that makes AI safety engineering so interesting as a field.
We agree that companies should employ engineers with product domain knowledge. I know this looks like a training set in the way its presented—especially since that’s what ML researchers are used to seeing—but we actually intended it as a toy model for automated detection and correction of unexpected ‘model splintering’ during monitoring of models in deployment.
In other words, this is something you would use on top of a model trained and monitored by engineers with domain knowledge, to assist them in their work when features splinter.
We ask them to not cheat in that way? That would be using their own implicit knowledge of what the features are.
I guess I should make another general remark here.
Yes, using implicit knowledge in your solution would be considered cheating, and bad form, when passing AI system benchmarks which intend to test more generic capabilities.
However, if I were to buy an alignment solution from a startup, then I would prefer to be told that the solution encodes a lot of relevant implicit knowledge about the problem domain. Incorporating such knowledge would no longer be cheating, it would be an expected part of safety engineering.
This seeming contradiction is of course one of these things that makes AI safety engineering so interesting as a field.
Hi Koen,
We agree that companies should employ engineers with product domain knowledge. I know this looks like a training set in the way its presented—especially since that’s what ML researchers are used to seeing—but we actually intended it as a toy model for automated detection and correction of unexpected ‘model splintering’ during monitoring of models in deployment.
In other words, this is something you would use on top of a model trained and monitored by engineers with domain knowledge, to assist them in their work when features splinter.
OK, that is a good way to frame it.