Leon Lang comments on A simple case for extreme inner misalignment

Leon Lang Jul 14, 2024, 10:53 AM
3 points
0
Thanks for the answer!
But basically, by “simple goals” I mean “goals which are simple to represent”, i.e. goals which have highly compressed representations
It seems to me you are using “compressed” in two very different meanings in part 1 and 2. Or, to be fairer, I interpret the meanings very differently.
I try to make my view of things more concrete to explain:
Compressed representations: A representation is a function $f : O \to R$ from observations of the world state $O$ (or sequences of such observations) into a representation space $R$ of “features”. That this is “compressed” means (a) that in $R$ , only a small number of features are active at any given time and (b) that this small number of features is still sufficient to predict/act in the world.
Goals building on compressed representations: A goal is a (maybe linear) function $U : R \to R$ from the representation space into the real numbers. The goal “likes” some features and “dislikes” others. (Or if it is not entirely linear, then it may like/dislike some simple combinations/compositions of features)
It seems to me that in part 2 of your post, you view goals as compositions $U \circ f : O \to R$ . Part 1 says that $f$ is highly compressed. But it’s totally unclear to me why the composition $U \circ f$ should then have the simplicity properties you claim in part 2, which in my mind don’t connect with the compression properties of $f$ as I just defined them.
A few more thoughts:
- The notion of “simplicity” in part $2$ seems to be about how easy it is to represent a function—i.e., the space of parameters with which the function $U \circ f$ is represented is simple in your part 2.
- The notion of “compression” in part 1 seems to be about how easy it is to represent an input—i.e., is there a small number of features such that their activation tells you the important things about the input?
- These notions of simplicity and compression are very different. Indeed, if you have a highly compressed representation $f$ as in part 1, I’d guess that $f$ necessarily lives in a highly complex space of possible functions with many parameters, thus the opposite of what seems to be going on in part 2.
This is largely my fault since I haven’t really defined “representation” very clearly, but I would say that the representation of the concept of a dog should be considered to include e.g. the neurons representing “fur”, “mouth”, “nose”, “barks”, etc. Otherwise if we just count “dog” as being encoded in a single neuron, then every concept encoded in any neuron is equally simple, which doesn’t seem like a useful definition.
(To put it another way: the representation is the information you need to actually do stuff with the concept.)
I’m confused. Most of the time, when seeing a dog, most of what I need is actually just to know that it is a “dog”, so this is totally sufficient to do something with the concept. E.g., if I walk on the street and wonder “will this thing bark?”, then knowing “my dog neuron activates” is almost enough.
I’m confused for a second reason: It seems like here you want to claim that the “dog” representation is NOT simple (since it contains “fur”, “mouth”, etc.). However, the “dog” representation needs lots of intelligence and should thus come along with compression, and if you equate compression and simplicity, then it seems to me like you’re not consistent. (I feel a bit awkward saying “you’re not consistent”, but I think it’s probably good if I state my honest state of mind at this moment).
To clarify my own position, in line with my definition of compression further above: I think that whether representation is simple/compressed is NOT a property of a single input-output relation (like “pixels of dog gets mapped to dog-neuron being activated”), but instead a property of the whole FUNCTION that maps inputs to representations. This function is compressed if for any given input, only a small number of neurons in the last layer activate, and if these can be used (ideally in a linear way) for further predictions and for evaluating goal-achievement.
I agree that most people who say they are hedonic utilitarians are not 100% committed to hedonic utilitarianism. But I still think it’s very striking that they at least somewhat care about making hedonium. I claim this provides an intuition pump for how AIs might care about squiggles too.
Okay, I agree with this, fwiw. :) (Though I may not necessarily agree with claims about how this connects to the rest of the post)