I’ve spent some time thinking about this and can share some thoughts.
I find the framing around preserving boundaries a bit odd. Seems like lack of preservation is one way things could be bad, but I don’t think it’s a full accounting for badness (or at least it seems that way based on my understanding of boundaries).
In humans, I strongly suspect that we can model badness as negative valence (@Steven Byrnesrecent series on valence is a good reference), and that the reason “bad” is a simple and fundamental word in English and most languages is because it’s basic to the way our minds work: bad is approximately stuff we don’t like, and good is stuff we do like, where liking is a function of how much it makes the world the way we want it to be, and wanting is a kind of expectation about future observations.
I also think we can generalize badness from humans and other animals with valence-oriented brains by using the language of control theory. There, we might classify sensor readings as bad if they signal movement away from rather than towards a goal. And since we can model living things as a complex network of layered negative feedback circuits, this suggests that anything is bad if it works against achieving a system’s purpose.
(I have a bit more of my thoughts on this in a draft book chapter, but I was not specifically trying to address this question so you might need to read between the lines a bit.)
Goodness, in these models, is simply the reverse of badness: positive valence things are good, as are sensor readings that signal a goal is being achieved.
There are some interesting caveats around what happens when you get multiple layers in the system that contradict each other, like if smoking a cigarette feels good but we know it’s bad for us, but the basic point stands.
I can’t think of anything else that would be missing from a full specification of badness.
Hello there! This idea might improve your post: I think no one can properly process the problem of badness without thinking of what is “good” at the same time. So I think the core idea I am trying to make here is that we should be able to train models with an accurate simulation of our world where both good and evil (badness) exist.
I wrote something about this here if you are interested.
I’ve spent some time thinking about this and can share some thoughts.
I find the framing around preserving boundaries a bit odd. Seems like lack of preservation is one way things could be bad, but I don’t think it’s a full accounting for badness (or at least it seems that way based on my understanding of boundaries).
In humans, I strongly suspect that we can model badness as negative valence (@Steven Byrnes recent series on valence is a good reference), and that the reason “bad” is a simple and fundamental word in English and most languages is because it’s basic to the way our minds work: bad is approximately stuff we don’t like, and good is stuff we do like, where liking is a function of how much it makes the world the way we want it to be, and wanting is a kind of expectation about future observations.
I also think we can generalize badness from humans and other animals with valence-oriented brains by using the language of control theory. There, we might classify sensor readings as bad if they signal movement away from rather than towards a goal. And since we can model living things as a complex network of layered negative feedback circuits, this suggests that anything is bad if it works against achieving a system’s purpose.
(I have a bit more of my thoughts on this in a draft book chapter, but I was not specifically trying to address this question so you might need to read between the lines a bit.)
Goodness, in these models, is simply the reverse of badness: positive valence things are good, as are sensor readings that signal a goal is being achieved.
There are some interesting caveats around what happens when you get multiple layers in the system that contradict each other, like if smoking a cigarette feels good but we know it’s bad for us, but the basic point stands.
Hello there! This idea might improve your post: I think no one can properly process the problem of badness without thinking of what is “good” at the same time. So I think the core idea I am trying to make here is that we should be able to train models with an accurate simulation of our world where both good and evil (badness) exist.
I wrote something about this here if you are interested.