Putting RamblinDash’s point another way: when Eliezer says “unlimited retries”, he’s not talking about a Groundhog Day style reset. He’s just talking about the mundane thing where, when you’re trying to fix a car engine or something, you try one fix, and if it doesn’t start, you try another fix, and if it still doesn’t start, you try another fix, and so on. So the scenario Eliezer is imagining is this: we have 50 years. Year 1, we build an AI, and it kills 1 million people. We shut it off. Year 2, we fix the AI. We turn it back on, it kills another million people. We shut it off, fix it, turn it back on. Etc, until it stops killing people when we turn it on. Eliezer is saying, if we had 50 years to do that, we could align an AI. The problem is, in reality, the first time we turn it on, it doesn’t kill 1 million people, it kills everyone. We only get one try.
I like your phrasing better, but I think it just hides some magic.
In this situation I think we get an AI that repeatedly kills 999,999 people. It’s just the nearest unblocked path problem.
The exact reset/restart/turn it off and try again condition matters, and nothing works unless the reset condition is ‘that isn’t going to do something we approve of’.
The only sense I can make of the idea is ‘If we already had a friendly AI to protect us while we played, we could work out how to build a friendly AI’.
I don’t think we could iterate to a good outcome, even if we had magic powers of iteration.
Your version makes it strictly harder than the ‘Groundhog Day with Memories Intact’ version. And I don’t think we could solve that version.
Putting RamblinDash’s point another way: when Eliezer says “unlimited retries”, he’s not talking about a Groundhog Day style reset. He’s just talking about the mundane thing where, when you’re trying to fix a car engine or something, you try one fix, and if it doesn’t start, you try another fix, and if it still doesn’t start, you try another fix, and so on. So the scenario Eliezer is imagining is this: we have 50 years. Year 1, we build an AI, and it kills 1 million people. We shut it off. Year 2, we fix the AI. We turn it back on, it kills another million people. We shut it off, fix it, turn it back on. Etc, until it stops killing people when we turn it on. Eliezer is saying, if we had 50 years to do that, we could align an AI. The problem is, in reality, the first time we turn it on, it doesn’t kill 1 million people, it kills everyone. We only get one try.
I like your phrasing better, but I think it just hides some magic.
In this situation I think we get an AI that repeatedly kills 999,999 people. It’s just the nearest unblocked path problem.
The exact reset/restart/turn it off and try again condition matters, and nothing works unless the reset condition is ‘that isn’t going to do something we approve of’.
The only sense I can make of the idea is ‘If we already had a friendly AI to protect us while we played, we could work out how to build a friendly AI’.
I don’t think we could iterate to a good outcome, even if we had magic powers of iteration.
Your version makes it strictly harder than the ‘Groundhog Day with Memories Intact’ version. And I don’t think we could solve that version.