Is it possible to ensure an AGI effectively acts according to a bounded utility function, with “do nothing” always a safe/decent option?
The goal would be to increase risk aversion enough that practical external deterrence is enough to keep that AGI from killing us all.
Maybe some more hardcoding or hand engineering in the designs?
Maybe, but we don’t have a particularly good understanding of how we would do that. This is sometimes termed “strawberry alignment”. Also, again, you have to figure out how to take “strawberry alignment” and use it to solve the problem that someone is eventually gonna do “not strawberry alignment”.
Is it possible to ensure an AGI effectively acts according to a bounded utility function, with “do nothing” always a safe/decent option?
The goal would be to increase risk aversion enough that practical external deterrence is enough to keep that AGI from killing us all.
Maybe some more hardcoding or hand engineering in the designs?
Maybe, but we don’t have a particularly good understanding of how we would do that. This is sometimes termed “strawberry alignment”. Also, again, you have to figure out how to take “strawberry alignment” and use it to solve the problem that someone is eventually gonna do “not strawberry alignment”.