I had misremembered something; I thought that there was a safeguard to ensure that it never tries to learn about its safeguards, rather than a prior making this unlikely.
Perfect safeguards are possible; in an extreme case, we could have a FAI monitoring every aspect of our first AI’s behaviour. Can you give me a specific example of a safeguard so I can find a hole in it? :)
I had misremembered something; I thought that there was a safeguard to ensure that it never tries to learn about its safeguards, rather than a prior making this unlikely.
Perfect safeguards are possible; in an extreme case, we could have a FAI monitoring every aspect of our first AI’s behaviour. Can you give me a specific example of a safeguard so I can find a hole in it? :)