I think Garrett is saying: our science gets good enough that we can tell that, in some situations, our models are going to do stuff we don’t like. We then look at the brain and try and see what the brain would do in that situation.
This seems possible, but I’m thinking more mechanistically than that. Borrowing terminology from I think Redwood’s mechanistic anomaly detection strategy, we want our AIs to make decisions for the same reasons that humans make decisions (though you can’t actually use their methods or directly apply their conceptual framework here, because we also want our AIs to get smarter than humans, which necessitates them making decisions for different reasons than humans, and humans make decisions on the basis of a bunch of stuff depending on context and their current mind-state).
I think Garrett is saying: our science gets good enough that we can tell that, in some situations, our models are going to do stuff we don’t like. We then look at the brain and try and see what the brain would do in that situation.
This seems possible, but I’m thinking more mechanistically than that. Borrowing terminology from I think Redwood’s mechanistic anomaly detection strategy, we want our AIs to make decisions for the same reasons that humans make decisions (though you can’t actually use their methods or directly apply their conceptual framework here, because we also want our AIs to get smarter than humans, which necessitates them making decisions for different reasons than humans, and humans make decisions on the basis of a bunch of stuff depending on context and their current mind-state).