Can we control the blind spots of the agent? For example, I could imaging that we could make a very strong agent, that is able to explain acausal trade but unable to (deliberately) participate in any acausal trades, because of the way it understands counterfacuals.
Could it be possible to create AI with similar minor weaknesses?
Probably not, because it’s hard to get a general intelligence to make consistently wrong decisions in any capacity. Partly because, like you or me, it might realize that it has a design flaw and work around it.
A better plan is just to explicitly bake corrigibility guarantees (i.e. the stop button) into the design. Figuring out how to do that that is the hard part, though.
Can we control the blind spots of the agent? For example, I could imaging that we could make a very strong agent, that is able to explain acausal trade but unable to (deliberately) participate in any acausal trades, because of the way it understands counterfacuals. Could it be possible to create AI with similar minor weaknesses?
Probably not, because it’s hard to get a general intelligence to make consistently wrong decisions in any capacity. Partly because, like you or me, it might realize that it has a design flaw and work around it.
A better plan is just to explicitly bake corrigibility guarantees (i.e. the stop button) into the design. Figuring out how to do that that is the hard part, though.