Noosphere89 comments on Trying to deconfuse some core AI x-risk problems

Noosphere89 18 Oct 2023 19:01 UTC
−2 points
−12

but the usage here isn’t intended to claim anything that isn’t a direct consequence of the Orthogonality Thesis.

I want to flag that the orthogonality thesis is incapable of supporting the assumption that powerful agents are dangerous by default, and the reason is that it only makes a possibility claim, not anything stronger than that. I think you need the assumption of 0 prior information in order to even vaguely support the hypothesis that AI is dangerous by default.

I’m just saying that agents capable of exerting large effects on their world are by default going to be dangerous, in the sense that the set of all possible large effects is large compared to the set of desirable large effects.

I think a critical disagreement is probably that I think even weak prior information shifts things to AI is safe by default, and that we don’t need to specify most of our values and instead offload most of the complexity to the learning process.

I think once you accept things which are relatively uncontroversial around here (the orthogonality thesis is true, powerful artificial minds are possible) the burden is then on the person claiming that some method for constructing a powerful mind (e.g. give it incomplete preferences) will not result in that mind being dangerous (or epsilon distance in mind-space from a mind that is dangerous) to show that that’s actually true.

I definitely disagree with that, and accepting the orthogonality thesis plus powerful minds are possible is not enough to shift the burden of proof unless something else is involved, since as stated it excludes basically nothing.

For example, on complete preferences, here’s a slightly more precise claim: any interesting and capable agent with incomplete preferences implies the possibility (via an often trivial construction) of a similarly-powerful agent with complete preferences, and that the agent with complete preferences will often be simpler and more natural in an intuitive sense.

This is not the case under things like invulnerable incomplete preferences, where they managed to weaken the axioms of EU theory enough to get a shutdownable agent:

https://www.lesswrong.com/posts/sHGxvJrBag7nhTQvb/invulnerable-incomplete-preferences-a-formal-statement-1
- Max H 18 Oct 2023 20:57 UTC
  6 points
  0
  Parent
  For example, on complete preferences, here’s a slightly more precise claim: any interesting and capable agent with incomplete preferences implies the possibility (via an often trivial construction) of a similarly-powerful agent with complete preferences, and that the agent with complete preferences will often be simpler and more natural in an intuitive sense.
  This is not the case under things like invulnerable incomplete preferences, where they managed to weaken the axioms of EU theory enough to get a shutdownable agent:
  https://www.lesswrong.com/posts/sHGxvJrBag7nhTQvb/invulnerable-incomplete-preferences-a-formal-statement-1
  I don’t see how this result contradicts my claim. If you can construct an agent with incomplete preferences that follows Dynamic Strong Maximality, you can just as easily (or more easily) construct an agent with complete preferences that doesn’t need to follow any such rule.
  
  Also, if DSM works in practice and doesn’t impose any disadvantages on an agent following it, a powerful agent with incomplete preferences following DSM will probably still tend to get what it wants (which may not be what you want).
  Constructing a DSM agent seems like a promising avenue if you need the agent to have weird / anti-natural preferences, e.g. total indifference to being shut down. But IIRC, the original shutdown problem was never intended to be a complete solution to the alignment problem, or even a practical subcomponent. It was just intended to show that a particular preference that is easy to describe in words and intuitively desirable as a safety property, is actually pretty difficult to write down in a way that fits into various frameworks for describing agents and their preferences in precise ways.