I very much appreciate the amount of time and effort you’re putting into this!
That said, as much as I’d like to engage with this post, it feels very hard for me to do. The main problem I’m having is that there are a lot of very specific details where I feel like I don’t have enough context to evaluate the details. By “context”, I mean that there are a million different ways by which one could choose to formalize human values, and I assume that you’ve got some very specific reasons for why you’ve made the specific formalization choices that you have made. And in order to evaluate whether these are good choices, I’d need to understand your goals in making said choices, but you seem to have only given us the end results of your thought process rather than the original goals of it.
For instance, you note that WH(v) can be 0 if a human has carefully considered it and found it to be irrelevant or negative. This sentence jumped out at me somewhat, since I would have intuitively assumed that if the human had evaluated something negative, it would be assigned a negative value rather than a 0; at least I wouldn’t have expected values that were evaluated as irrelevant, to be assigned the same score as values that were evaluated as negative!
Reading on, I found that you separately define an endorsement of v, which can be negative—so apparently if we have evaluated things as negative, we can maybe still model that by assigning the thing a positive value and then giving it a negative endorsement value? I’m confused as to why these are split into two different variables. “Endorsement” suggests that it’s about meta-values, so that the intent of this separation would be to model things which the human likes but doesn’t actually endorse liking. But that doesn’t capture the possibility that they e.g. dislike pain, and also endorse disliking pain.
Or maybe, since a value v was supposed to be defined as a statement which a human might agree to, we’re supposed to model pain avoidance as a positive claim, “pain is to be avoided”, which is then given a positive value? That would make sense, but in that case I’m again unclear on what the endorsement thing is meant to model, since apparently it doesn’t take things like “liking” into account at all, but rather acts directly on endorsements?
So I mentally tag this as unclear and try to read on, hoping that this will be clarified later in the article, but instead I seem to run into a lot more specific choices and assumptions, and get the feeling that the article’s assuming me to already have understood the previous sections in each new section it introduces… at which point I gave up.
What would make this much more readable for me would be something like, each subsection starting with the philosophical motivation and desiderata for the formalization choices made in that section, then having the content that it has now, and then finally giving some practical examples of what these formalizations imply and what kinds of mathematical objects result as a consequence. (Not necessarily always in that order: some mixing might be in order. E.g. for section 1.1, you have the line “Object level values are those which are non-zero only on rewards”; this seems to suggest that there may be values which refer to other values, separately from having the value also contain an endorsement for its assigned reward...? So you could have a value that assigns a positive value to some reward, a negative endorsement of that reward, and then a separate value which assigns treats the outcome of the first value as a positive reward with some weight, and it also assigns a positive or negative endorsement to the result of that computation...? I’m probably misunderstanding this somehow, which a bunch of examples about object-level and non-object-level values would clear up.)
Knowing at least what’s the kind of real-world thing that the formalism is trying to capture, would help a lot when I was trying to evaluate whether I’d interpreted something you said correctly.
Ok, I will rework it for improved clarity; but not all the options I chose have deep philosophical justifications. As I said, I was aiming for an adequate resolution, with people’s internal meta-values working as philosophical justifications for their own resolution.
As for the specific case that tripped you up: I wanted to distinguish between endorsing a reward or value, endorsing its negative, and endorsing not having it. “I want to be thin” vs “I want to be fat” vs “I don’t want to care about my weight”. The first one I track as a positive endorsement of R, the second as a positive endorsement of -R, the third as a negative endorsement of R (and of -R).
not all the options I chose have deep philosophical justifications.
Just to be clear, when I said that each section would be served by having a philosophical justification, I don’t mean that it would necessarily need to be super-deep; just something like “this seems to make sense because X”, which e.g. sections 2.4 and 2.5 already have.
I very much appreciate the amount of time and effort you’re putting into this!
That said, as much as I’d like to engage with this post, it feels very hard for me to do. The main problem I’m having is that there are a lot of very specific details where I feel like I don’t have enough context to evaluate the details. By “context”, I mean that there are a million different ways by which one could choose to formalize human values, and I assume that you’ve got some very specific reasons for why you’ve made the specific formalization choices that you have made. And in order to evaluate whether these are good choices, I’d need to understand your goals in making said choices, but you seem to have only given us the end results of your thought process rather than the original goals of it.
For instance, you note that WH(v) can be 0 if a human has carefully considered it and found it to be irrelevant or negative. This sentence jumped out at me somewhat, since I would have intuitively assumed that if the human had evaluated something negative, it would be assigned a negative value rather than a 0; at least I wouldn’t have expected values that were evaluated as irrelevant, to be assigned the same score as values that were evaluated as negative!
Reading on, I found that you separately define an endorsement of v, which can be negative—so apparently if we have evaluated things as negative, we can maybe still model that by assigning the thing a positive value and then giving it a negative endorsement value? I’m confused as to why these are split into two different variables. “Endorsement” suggests that it’s about meta-values, so that the intent of this separation would be to model things which the human likes but doesn’t actually endorse liking. But that doesn’t capture the possibility that they e.g. dislike pain, and also endorse disliking pain.
Or maybe, since a value v was supposed to be defined as a statement which a human might agree to, we’re supposed to model pain avoidance as a positive claim, “pain is to be avoided”, which is then given a positive value? That would make sense, but in that case I’m again unclear on what the endorsement thing is meant to model, since apparently it doesn’t take things like “liking” into account at all, but rather acts directly on endorsements?
So I mentally tag this as unclear and try to read on, hoping that this will be clarified later in the article, but instead I seem to run into a lot more specific choices and assumptions, and get the feeling that the article’s assuming me to already have understood the previous sections in each new section it introduces… at which point I gave up.
What would make this much more readable for me would be something like, each subsection starting with the philosophical motivation and desiderata for the formalization choices made in that section, then having the content that it has now, and then finally giving some practical examples of what these formalizations imply and what kinds of mathematical objects result as a consequence. (Not necessarily always in that order: some mixing might be in order. E.g. for section 1.1, you have the line “Object level values are those which are non-zero only on rewards”; this seems to suggest that there may be values which refer to other values, separately from having the value also contain an endorsement for its assigned reward...? So you could have a value that assigns a positive value to some reward, a negative endorsement of that reward, and then a separate value which assigns treats the outcome of the first value as a positive reward with some weight, and it also assigns a positive or negative endorsement to the result of that computation...? I’m probably misunderstanding this somehow, which a bunch of examples about object-level and non-object-level values would clear up.)
Knowing at least what’s the kind of real-world thing that the formalism is trying to capture, would help a lot when I was trying to evaluate whether I’d interpreted something you said correctly.
Thanks!
Ok, I will rework it for improved clarity; but not all the options I chose have deep philosophical justifications. As I said, I was aiming for an adequate resolution, with people’s internal meta-values working as philosophical justifications for their own resolution.
As for the specific case that tripped you up: I wanted to distinguish between endorsing a reward or value, endorsing its negative, and endorsing not having it. “I want to be thin” vs “I want to be fat” vs “I don’t want to care about my weight”. The first one I track as a positive endorsement of R, the second as a positive endorsement of -R, the third as a negative endorsement of R (and of -R).
But I’ll work on it more.
Thanks!
Just to be clear, when I said that each section would be served by having a philosophical justification, I don’t mean that it would necessarily need to be super-deep; just something like “this seems to make sense because X”, which e.g. sections 2.4 and 2.5 already have.