You seem to be approaching a kind of reasoning about this similar to something I’ve explored in a slightly different way (reducing risk of false positives in AI alignment mechanism design). You might find it interesting.
Seems a little bit beyond me at 4:45am—I’ll probably take a look tomorrow when I’m less sleep deprived (although still can’t guarantee I’ll be able to make it through then; there’s quite a bit of technical language in there that makes my head spin.) Are you able to provide a brief tl;dr, and have you thought much about “sign flip in reward function” or “direction of updates to reward model flipped”-type errors specifically? It seems like these particularly nasty bugs could plausibly be mitigated more easily than avoiding false positives (as you defined them in the arxiv’s paper’s abstract) in general.
Sleep is very important! Get regular sleep every night! Speaking from personal experience, you don’t want to have a sleep-deprivation-induced mental breakdown while thinking about Singularity stuff!
Actually, I’m not sure that sign flips are easier to deal with. A sentiment I’ve heard expressed before is that it’s much easier to trim something to more a little more or less of it, but it’s much harder to know if you’ve got it pointed in the right direction or not. Ultimately, addressing false positives though ends up being about these kind of directional issues.
it’s much harder to know if you’ve got it pointed in the right direction or not
Perhaps, but the type of thing I’m describing in the post is more preventing worse-than-death outcomes even if the sign is flipped (by designing a reward function/model in such a way that it’s not going to torture everyone if that’s the case.)
This seems easier than recognising whether the sign is flipped or just designing a system that can’t experience these sign-flip type errors; I’m just unsure whether this is something that we have robust solutions for. If it turns out that someone’s figured out a reliable solution to this problem, then the only real concern is whether the AI’s developers would bother to implement it. I’d much rather risk the system going wrong and paperclipping than going wrong and turning “I have no mouth, and I must scream” into a reality.
You seem to be approaching a kind of reasoning about this similar to something I’ve explored in a slightly different way (reducing risk of false positives in AI alignment mechanism design). You might find it interesting.
Seems a little bit beyond me at 4:45am—I’ll probably take a look tomorrow when I’m less sleep deprived (although still can’t guarantee I’ll be able to make it through then; there’s quite a bit of technical language in there that makes my head spin.) Are you able to provide a brief tl;dr, and have you thought much about “sign flip in reward function” or “direction of updates to reward model flipped”-type errors specifically? It seems like these particularly nasty bugs could plausibly be mitigated more easily than avoiding false positives (as you defined them in the arxiv’s paper’s abstract) in general.
Sleep is very important! Get regular sleep every night! Speaking from personal experience, you don’t want to have a sleep-deprivation-induced mental breakdown while thinking about Singularity stuff!
My anxieties over this stuff tend not to be so bad late at night, TBH.
Actually, I’m not sure that sign flips are easier to deal with. A sentiment I’ve heard expressed before is that it’s much easier to trim something to more a little more or less of it, but it’s much harder to know if you’ve got it pointed in the right direction or not. Ultimately, addressing false positives though ends up being about these kind of directional issues.
Perhaps, but the type of thing I’m describing in the post is more preventing worse-than-death outcomes even if the sign is flipped (by designing a reward function/model in such a way that it’s not going to torture everyone if that’s the case.)
This seems easier than recognising whether the sign is flipped or just designing a system that can’t experience these sign-flip type errors; I’m just unsure whether this is something that we have robust solutions for. If it turns out that someone’s figured out a reliable solution to this problem, then the only real concern is whether the AI’s developers would bother to implement it. I’d much rather risk the system going wrong and paperclipping than going wrong and turning “I have no mouth, and I must scream” into a reality.