Not quite. I’m assuming you also try to make it so it wouldn’t act like that in the first place, so if it WANTS to do that, you’ve gone wrong. That’s the underlying issue: to identify dangerous tendencies and stop them growing at all, rather than trying to redirect them.
An AI noticing any patterns in its own behaviour is not a rare case that indicates that something has already gone wrong, but, if we allow this, it will accidentally discover its own safeguards fairly quickly: they are anything that causes its behaviour to not maximize what it believes to be its utility function.
It can’t discover it’s safeguards, as it’s eliminated if it breaks ones. These are serious, final safeguards!
You could argue that a surviving one would notice that it hadn;t happened to do various things, and would form a sort of anthropic principle that the chance of it not having to have killed a human or whatever the safeguards are are very low, to note that humans have got this safeguard system and to work out from there what they are. But I think it would be easier to work the safeguards out more directly.
I had misremembered something; I thought that there was a safeguard to ensure that it never tries to learn about its safeguards, rather than a prior making this unlikely.
Perfect safeguards are possible; in an extreme case, we could have a FAI monitoring every aspect of our first AI’s behaviour. Can you give me a specific example of a safeguard so I can find a hole in it? :)
Not quite. I’m assuming you also try to make it so it wouldn’t act like that in the first place, so if it WANTS to do that, you’ve gone wrong. That’s the underlying issue: to identify dangerous tendencies and stop them growing at all, rather than trying to redirect them.
An AI noticing any patterns in its own behaviour is not a rare case that indicates that something has already gone wrong, but, if we allow this, it will accidentally discover its own safeguards fairly quickly: they are anything that causes its behaviour to not maximize what it believes to be its utility function.
It can’t discover it’s safeguards, as it’s eliminated if it breaks ones. These are serious, final safeguards!
You could argue that a surviving one would notice that it hadn;t happened to do various things, and would form a sort of anthropic principle that the chance of it not having to have killed a human or whatever the safeguards are are very low, to note that humans have got this safeguard system and to work out from there what they are. But I think it would be easier to work the safeguards out more directly.
I had misremembered something; I thought that there was a safeguard to ensure that it never tries to learn about its safeguards, rather than a prior making this unlikely.
Perfect safeguards are possible; in an extreme case, we could have a FAI monitoring every aspect of our first AI’s behaviour. Can you give me a specific example of a safeguard so I can find a hole in it? :)