What’s the complexity of inputs the robot is using? I think you’re mixing up levels of abstraction if you have a relatively complex model for reward, but you only patch using trivially-specific one-liners.
If we’re talking about human-level or better AI, why wouldn’t the patch be at roughly the same abstraction as for humans? Grandpa yells at the kid if he breaks a white vase while sweeping up the shop. The kid doesn’t then think it’s OK to break other stuff. Maybe the patch is more systemic—grandpa helps the kid by pointing out a sweeping style that puts less merchandise at risk (and only incidentally is an ancient bo stick fighting style).
In any case, it’s never at risk of being interpreted as “don’t break white vases”.
Even for the nascent chatbots and “AI” systems in play today, which are FAR less complicated and which very often have hamfisted patches to adjust output in ways that are hard to train (like “never use this word, even if it’s in a lot of source material”), the people writing patches know the system well enough to know when to fix the model and how to write a semi-general patch.
What’s the complexity of inputs the robot is using? I think you’re mixing up levels of abstraction if you have a relatively complex model for reward, but you only patch using trivially-specific one-liners.
If we’re talking about human-level or better AI, why wouldn’t the patch be at roughly the same abstraction as for humans? Grandpa yells at the kid if he breaks a white vase while sweeping up the shop. The kid doesn’t then think it’s OK to break other stuff. Maybe the patch is more systemic—grandpa helps the kid by pointing out a sweeping style that puts less merchandise at risk (and only incidentally is an ancient bo stick fighting style).
In any case, it’s never at risk of being interpreted as “don’t break white vases”.
Even for the nascent chatbots and “AI” systems in play today, which are FAR less complicated and which very often have hamfisted patches to adjust output in ways that are hard to train (like “never use this word, even if it’s in a lot of source material”), the people writing patches know the system well enough to know when to fix the model and how to write a semi-general patch.
The patches used are to illustrate the point rather than for being realistic specific examples.