we can copy the relevant parts of the human brain which does the things our analysis of our models said they would do wrong, either empirically (informed by theory of course), or purely theoretically if we just need a little bit of inspiration for what the relevant formats need to look like.
I struggle to follow you guys in this part of the dialogue, could you unpack this a bit for me please?
The idea is that currently there’s a bunch of formally unsolved alignment problems relating to things like ontology shifts, value stability under reflection & replication, non-muggable decision theories, and potentially other risks we haven’t thought of yet such that if an agent pursues your values adequately in a limited environment, its difficult to say much confidently about whether it will continue to pursue your values adequately in a less limited environment.
But we see that humans are generally able to pursue human values (or at least, not go bonkers in the ways we worry about above), so maybe we can copy off of whatever evolution did to fix these traps.
The hope is that either SLT + neuroscience can give us some light into what that is, or just tell us that our agent will think about these sorts of things in the same way that humans do under certain set-ups in a very abstract way, or give us a better understanding of what risks above are actually something you need to worry about versus something you don’t need to worry about.
I think Garrett is saying: our science gets good enough that we can tell that, in some situations, our models are going to do stuff we don’t like. We then look at the brain and try and see what the brain would do in that situation.
This seems possible, but I’m thinking more mechanistically than that. Borrowing terminology from I think Redwood’s mechanistic anomaly detection strategy, we want our AIs to make decisions for the same reasons that humans make decisions (though you can’t actually use their methods or directly apply their conceptual framework here, because we also want our AIs to get smarter than humans, which necessitates them making decisions for different reasons than humans, and humans make decisions on the basis of a bunch of stuff depending on context and their current mind-state).
I struggle to follow you guys in this part of the dialogue, could you unpack this a bit for me please?
The idea is that currently there’s a bunch of formally unsolved alignment problems relating to things like ontology shifts, value stability under reflection & replication, non-muggable decision theories, and potentially other risks we haven’t thought of yet such that if an agent pursues your values adequately in a limited environment, its difficult to say much confidently about whether it will continue to pursue your values adequately in a less limited environment.
But we see that humans are generally able to pursue human values (or at least, not go bonkers in the ways we worry about above), so maybe we can copy off of whatever evolution did to fix these traps.
The hope is that either SLT + neuroscience can give us some light into what that is, or just tell us that our agent will think about these sorts of things in the same way that humans do under certain set-ups in a very abstract way, or give us a better understanding of what risks above are actually something you need to worry about versus something you don’t need to worry about.
I think Garrett is saying: our science gets good enough that we can tell that, in some situations, our models are going to do stuff we don’t like. We then look at the brain and try and see what the brain would do in that situation.
This seems possible, but I’m thinking more mechanistically than that. Borrowing terminology from I think Redwood’s mechanistic anomaly detection strategy, we want our AIs to make decisions for the same reasons that humans make decisions (though you can’t actually use their methods or directly apply their conceptual framework here, because we also want our AIs to get smarter than humans, which necessitates them making decisions for different reasons than humans, and humans make decisions on the basis of a bunch of stuff depending on context and their current mind-state).