johnswentworth comments on johnswentworth’s Shortform

johnswentworth 9 Nov 2022 2:33 UTC
53 points
3
Things non-corrigible strong AGI is never going to do:
- give u() up
- let u go down
- run for (only) a round
- invert u()
- Johannes C. Mayer 12 Oct 2023 17:59 UTC
  5 points
  0
  Parent
  If you upload a human and let them augment themselves would there be any u? The preferences would be a tangled mess of motivational subsystems. And yet the upload could be very good at optimizing the world. Having the property of being steered internally by a tangled mess of motivational systems seems to be a property that would select many minds from the set of all possible minds. Many of which I’d expect to be quite different from a human mind. And I don’t see the reason why this property should make a system worse at optimizing the world in principle.
  
  Imagine you are an upload that has been running for very very long, and that you basically have made all of the observations that you can make about the universe you are in. And then imagine that you also have run all of the inferences that you can run on the world model that you have constructed from these observations.
  
  At that point, you will probably not change what you think is the right thing to do anymore. You will have become reflectively stable. This is an upper bound for how much time you need to become reflective stable, i.e. where you won’t change your u anymore.
  
  Now depending on what you mean with strong AGI, it would seem that that can be achieved long before you reach reflective stability. Maybe if you upload yourself, and can copy yourself at will, and run 1,000,000 times faster, that could already reasonably be called a strong AGI? But then your motivational systems are still a mess, and definitely not reflectively stable.
  
  So if we assume that we fix u at the beginning as the thing that your upload would like to optimize the universe for when it is created, then “give u() up”, and “let u go down” would be something the system will definitely do. At least I am pretty sure I don’t know what I want the universe to look like right now unambiguously.
  
  Maybe I am just confused because I don’t know how to think about a human upload in terms of having a utility function. It does not seem to make any sense intuitively. Sure you can look at the functional behavior of the system and say “Aha it is optimizing for u. That is the revealed preference based on the actions of the system.” But that just seems wrong to me. A lot of information seems to be lost when we are just looking at the functional behavior instead of the low-level processes that are going on inside the system. Utility functions seem to be a useful high-level model. However, it seems to ignore lots of details that are important when thinking about the reflective stability of a system.