Then what is the requirements for the framework to be applicable? Many human values, the ones we haven’t self-analysed much, behave like H and its buttons: swayed by random considerations that we’re not sure are value-relevant or not.
I think there are two ways that a reward function can be applicable:
1) For making moral judgements about how you should treat your agent. Probably irrelevant for your button presser unless you’re a panpsychist.
2) If the way your agent works is by predicting the consequences of its actions and attempting to pick an action that maximises some reward (eg a chess computer trying to maximise its board valuation function). Your agent H as described doesn’t work this way, although as you note there are agents which do act this way and produce the same behaviour as your H.
There’s also the kind-of option:
3) Anything can be modelled as if it had a utility function, in the same way that any solar system can be modelled as a geocentric one with enough epicycles. In this case there’s no “true” reward function, just “the reward function that makes the maths I want to do as easy as possible”. Which one that is depends on what you’re trying to do, and maybe pretending there’s a reward function isn’t actually better than using H’s true non-reward-based algorithm.
what is the requirements for the framework to be applicable?
This framework lives in the map, not in the territory. It is a model feature, applicable when it makes a model more useful. Specifically, it makes sense when the underlying reality is too complex to deal with directly. Because of the complexity we, basically, reduce the dimensionality of the problem by modeling it as a simpler combination of aggregates. “Values” are one kind of such aggregates.
If you have an uncomplicated algorithm with known code, you don’t need such simplifying features.
It is partly in the territory, and comes with the situation where you are modeling yourself. In that situation, the thing will always be “too complex to deal with directly,” regardless of its absolute level of complexity.
Isn’t a big part of the problem the fact that you only have conscious access to a few things? In other words, your actions are determined in many ways by an internal economy that you are ignorant of (e.g. mental energy, physical energy use in the brain, time and space etc. etc.) These things are in fact value relevant but you do not know much about them so you end up making up reasons why you did what you did.
Then what is the requirements for the framework to be applicable? Many human values, the ones we haven’t self-analysed much, behave like H and its buttons: swayed by random considerations that we’re not sure are value-relevant or not.
I think there are two ways that a reward function can be applicable:
1) For making moral judgements about how you should treat your agent. Probably irrelevant for your button presser unless you’re a panpsychist.
2) If the way your agent works is by predicting the consequences of its actions and attempting to pick an action that maximises some reward (eg a chess computer trying to maximise its board valuation function). Your agent H as described doesn’t work this way, although as you note there are agents which do act this way and produce the same behaviour as your H.
There’s also the kind-of option:
3) Anything can be modelled as if it had a utility function, in the same way that any solar system can be modelled as a geocentric one with enough epicycles. In this case there’s no “true” reward function, just “the reward function that makes the maths I want to do as easy as possible”. Which one that is depends on what you’re trying to do, and maybe pretending there’s a reward function isn’t actually better than using H’s true non-reward-based algorithm.
My “solution” does use 2), and should be posted in the next few days (maybe on lesswrong 2 only—are you on that?)
This framework lives in the map, not in the territory. It is a model feature, applicable when it makes a model more useful. Specifically, it makes sense when the underlying reality is too complex to deal with directly. Because of the complexity we, basically, reduce the dimensionality of the problem by modeling it as a simpler combination of aggregates. “Values” are one kind of such aggregates.
If you have an uncomplicated algorithm with known code, you don’t need such simplifying features.
It is partly in the territory, and comes with the situation where you are modeling yourself. In that situation, the thing will always be “too complex to deal with directly,” regardless of its absolute level of complexity.
Maybe, but that’s not the context in this thread.
Isn’t a big part of the problem the fact that you only have conscious access to a few things? In other words, your actions are determined in many ways by an internal economy that you are ignorant of (e.g. mental energy, physical energy use in the brain, time and space etc. etc.) These things are in fact value relevant but you do not know much about them so you end up making up reasons why you did what you did.