This seems possible, but I’m thinking more mechanistically than that. Borrowing terminology from I think Redwood’s mechanistic anomaly detection strategy, we want our AIs to make decisions for the same reasons that humans make decisions (though you can’t actually use their methods or directly apply their conceptual framework here, because we also want our AIs to get smarter than humans, which necessitates them making decisions for different reasons than humans, and humans make decisions on the basis of a bunch of stuff depending on context and their current mind-state).
This seems possible, but I’m thinking more mechanistically than that. Borrowing terminology from I think Redwood’s mechanistic anomaly detection strategy, we want our AIs to make decisions for the same reasons that humans make decisions (though you can’t actually use their methods or directly apply their conceptual framework here, because we also want our AIs to get smarter than humans, which necessitates them making decisions for different reasons than humans, and humans make decisions on the basis of a bunch of stuff depending on context and their current mind-state).