It can’t discover it’s safeguards, as it’s eliminated if it breaks ones. These are serious, final safeguards!
You could argue that a surviving one would notice that it hadn;t happened to do various things, and would form a sort of anthropic principle that the chance of it not having to have killed a human or whatever the safeguards are are very low, to note that humans have got this safeguard system and to work out from there what they are. But I think it would be easier to work the safeguards out more directly.
I had misremembered something; I thought that there was a safeguard to ensure that it never tries to learn about its safeguards, rather than a prior making this unlikely.
Perfect safeguards are possible; in an extreme case, we could have a FAI monitoring every aspect of our first AI’s behaviour. Can you give me a specific example of a safeguard so I can find a hole in it? :)
It can’t discover it’s safeguards, as it’s eliminated if it breaks ones. These are serious, final safeguards!
You could argue that a surviving one would notice that it hadn;t happened to do various things, and would form a sort of anthropic principle that the chance of it not having to have killed a human or whatever the safeguards are are very low, to note that humans have got this safeguard system and to work out from there what they are. But I think it would be easier to work the safeguards out more directly.
I had misremembered something; I thought that there was a safeguard to ensure that it never tries to learn about its safeguards, rather than a prior making this unlikely.
Perfect safeguards are possible; in an extreme case, we could have a FAI monitoring every aspect of our first AI’s behaviour. Can you give me a specific example of a safeguard so I can find a hole in it? :)