Vladimir_Nesov comments on How I think about alignment

Vladimir_Nesov 13 Aug 2022 15:48 UTC
LW: 5 AF: 2
3
AF

Somewhere in my brain there is some sort of physical encoding of my values.

Not sure if this is an intended meaning, but the claim that values don’t depend on content of the world outside the brain is generally popular (especially in decision theory), and there seems to be no basis for it. Brains are certainly some sort of pointers to value, but a lot (or at least certainly some) of the content of values could be somewhere else, most likely in civilization’s culture.

This is an important distinction for corrigibility, because this claim is certainly false for a corrigible agent, it instead wants to find content of its values in environment, it’s not part of its current definition/computation. It also doesn’t make sense to talk about this agent pursuing its goals in a diverse set of environments, unless we expect the goals to vary with environment.

For decision theory of such agents, this could be a crucial point. For example, an updateless corrigible agent wouldn’t be able to know the goals that it must choose a policy in pursuit of. The mapping from observations to actions that UDT would pick now couldn’t be chosen as the most valuable mapping, because value/goal itself depends on observations, and even after some observations it’s not pinned down precisely. So if this point is taken into account, we need a different decision theory, even if it’s not trying to do anything fancy with corrigibility or mild optimization, but merely acknowledges that goal content could be located in the environment!
- Linda Linsefors 13 Aug 2022 18:46 UTC
  LW: 2 AF: 2
  0
  AF Parent
  I mean that the information of what I value exists in my brain. Some of this information is pointers to things in the real world. So in a sense the information partly exist in the relation/correlation between me and the world.
  
  I defiantly don’t mean that I can only care about my internal brain state. To me that is just obviously wrong. Although I have met people who disagree, so I see where the misunderstanding came from.
  - Vladimir_Nesov 13 Aug 2022 19:18 UTC
    LW: 3 AF: 2
    2
    AF Parent
    That’s not what I’m talking about. I’m not talking about what known goals are saying, or what they are speaking of, what they consider valuable or important. I’m talking about where the data to learn what they are is located, as we start out not knowing the goals at all and need to learn them. There is a particular thing, say a utility function, that is the intended formulation of goals. It could be the case that this intended utility function could be found somewhere in the brain. That doesn’t mean that it’s a utility function that cares about brains, the questions of where it’s found and what it cares about are unrelated.
    
    Or it could be the case that it’s recorded on an external hard drive, and the brain only contains the name of the drive (this name is a “pointer to value”). It’s simply not the case that you can recover this utility function without actually looking at the drive, and only looking at the brain. So utility function u itself depends on environment E, that is there is some method of formulating utility functions t such that u=t(E). This is not the same as saying that utility of environment depends on environment, giving the utility value u(E)=t(E)(E) (there’s no typo here). But if it’s actually in the brain, and says that hard drives are extremely valuable, then you do get to know what it is without looking at the hard drives, and learn that it values hard drives.