I wrote this for a discord server. It’s a hopefully very precise argument for unaligned intelligence being possible in principle (which was being debated), which was aimed at aiding early deconfusion about questions like ‘what are values fundamentally, though?’ since there was a lot of that implicitly, including some with moral realist beliefs.
1. There is an algorithm behind intelligent search. Like simpler search processes, this algorithm does not, fundamentally, need to have some specific value about what to search for—for if it did, one’s search process would always search for the same thing when you tried to use it to answer any unrelated question. 2. Imagine such an algorithm which takes as input a specification (2) of what to search for. 3. After that, you can combine these with an algorithm which takes as input the output of the search algorithm (1) and does something with it.
For example, if (2) specifies to search for the string of text that, if displayed on a screen, maximizes the amount of x in (1)’s model of the future of the world that screen is in, then (3) can be an algorithm which displays that selected string of text on the screen, thereby actually maximizing x.
Hopefully this makes the idea of unaligned superintelligence more precise. This would actually be possible even if moral realism was true (except for versions where the universe itself intervenes on this formally possible algorithm).
(2) is what I might call (if I wasn’t writing very precisely) the ‘value function’ of this system.
notes: - I use ‘algorithm’ in a complexity-neutral way. - An actual trained neural network would of course be more messy, and need not share something isomorphic to each of these three components at all - This model implies the possibility of an algorithm which intelligently searches for text which, if displayed on the screen, maximizes x—and then doesn’t display it, or does something else with it, not because that other thing is what it ‘really values’, but simply because that is what the modified algorithm says. This highlights that the property ‘has effects which optimize the world’ is not a necessary property of a(n) (super)intelligent system.
I wrote this for a discord server. It’s a hopefully very precise argument for unaligned intelligence being possible in principle (which was being debated), which was aimed at aiding early deconfusion about questions like ‘what are values fundamentally, though?’ since there was a lot of that implicitly, including some with moral realist beliefs.
notes:
- I use ‘algorithm’ in a complexity-neutral way.
- An actual trained neural network would of course be more messy, and need not share something isomorphic to each of these three components at all
- This model implies the possibility of an algorithm which intelligently searches for text which, if displayed on the screen, maximizes x—and then doesn’t display it, or does something else with it, not because that other thing is what it ‘really values’, but simply because that is what the modified algorithm says. This highlights that the property ‘has effects which optimize the world’ is not a necessary property of a(n) (super)intelligent system.