I was directed here from an article about malign priors. I saw it was argued that the implicit prior in standard machine learning algorithms was probably malign.
And I’m wondering how you could, even in principle, learn a prior that’s knowably not malign starting from a potentially malign one.
I think one of the biggest dangerous of a malign prior is that it could result in a treacherous turn to seize power, for example from believing they are in an alien simulation that would reward them for doing so. But if the implicit prior in machine learning algorithms would do this, I don’t see how to make it avoid learning a prior that itself has a treacherous turn. That is, normally it would provide reasonable results, but at some point, when it’s most important to avoid it, the AI will think something dangerous, like that it’s in a simulation incentivizing misbehavior. I mean, I think whatever agent could manipulate the AI’s beliefs using the implicit prior in machine learning would have an incentive to manipulate the AI’s beliefs about the learned prior to allow for the agent to also control the AI through it.
The only ways I can think of is by either hoping your prior learner is too stupid to manage to embed a treacherous turn in its learned prior, or having high-enough interpretability that you can verify the prior is safe. But an agent trying to make the learned prior malign would have an incentive to do whatever it can to make the learned malign prior look as safe as possible to any sort of interpretability tools, which could make this hard.
I haven’t been able to find any other articles about this, so if anyone could link some, that would be great.
I was directed here from an article about malign priors. I saw it was argued that the implicit prior in standard machine learning algorithms was probably malign.
And I’m wondering how you could, even in principle, learn a prior that’s knowably not malign starting from a potentially malign one.
I think one of the biggest dangerous of a malign prior is that it could result in a treacherous turn to seize power, for example from believing they are in an alien simulation that would reward them for doing so. But if the implicit prior in machine learning algorithms would do this, I don’t see how to make it avoid learning a prior that itself has a treacherous turn. That is, normally it would provide reasonable results, but at some point, when it’s most important to avoid it, the AI will think something dangerous, like that it’s in a simulation incentivizing misbehavior. I mean, I think whatever agent could manipulate the AI’s beliefs using the implicit prior in machine learning would have an incentive to manipulate the AI’s beliefs about the learned prior to allow for the agent to also control the AI through it.
The only ways I can think of is by either hoping your prior learner is too stupid to manage to embed a treacherous turn in its learned prior, or having high-enough interpretability that you can verify the prior is safe. But an agent trying to make the learned prior malign would have an incentive to do whatever it can to make the learned malign prior look as safe as possible to any sort of interpretability tools, which could make this hard.
I haven’t been able to find any other articles about this, so if anyone could link some, that would be great.