I’m now tempted to include this announcement of the newsletter in the newsletter just for the one-off recursion joke I can make.
I say go for it, but then my highest voted submission to discussion was this.
I’m now tempted to include this announcement of the newsletter in the newsletter just for the one-off recursion joke I can make.
I say go for it, but then my highest voted submission to discussion was this.
If this article makes it to 20 votes will it be included in the newsletter?
But that’s the thing. There is no sensory input for “social deference”. It has to be inferred from an internal model of the world itself inferred from sensory data...Reinforcement learning works fine when you have a simple reward signal you want to maximize. You can’t use it for social instincts or morality, or anything you can’t just build a simple sensor to detect.
Why does it only work on simple signals? Why can’t the result of inference work for reinforcement learning?
I don’t think that humans are pure reinforcement learners. We have all sorts of complicated values that aren’t just eating and mating.
We may not be pure reinforcement learners, but the presence of values other than eating and mating isn’t a proof of that. Quite the contrary, it demonstrates that either we have a lot of different, occasionally contradictory values hardwired or that we have some other system that’s creating value systems. From an evolutionary standpoint reward systems that are good at replicating genes get to survive, but they don’t have to be free of other side effects (until given long enough with a finite resource pool maybe). Pure, rational reward seeking is almost certainly selected against because it doesn’t leave any room for replication. It seems more likely that we have a reward system that is accompanied by some circuits that make it fire for a few specific sensory cues (orgasms, insulin spikes, receiving social deference, etc.).
The toy AI has an internal model of the universe, it has an internal utility function which somehow measures the universe model and calculates utility from it....[toy AI is actually paperclip optimizer]...Stuff like changing its utility function or fooling its sensors would not be chosen because it knows that doesn’t lead to real paperclips.
I think we’ve been here before ;-)
Thanks for trying to help me understand this. Gram_Stone linked a paper that explains why the class of problems that I’m describing aren’t really problems.
Okay, I am convinced. I really, really appreciate you sticking with me through this and persistently finding different ways to phrase your side and then finding ways that other people have phrased it.
For reference it was the link to the paper/book that did it. The parts of it that are immediately relevant here are chapter 3 and section 4.2.1.1 (and optionally section 5.3.5). In particular, chapter 3 explicitly describes an order of operations of goal and subgoal evaluation and then the two other sections show how wireheading is discounted as a failing strategy within a system with a well-defined order of operations. Whatever problems there may be with value stability, this has helped to clear out a whole category of mistakes that I might have made.
Again, I really appreciate the effort that you put in. Thanks a load.
How would that [valuing universe-states themselves] work? Well that’s the quadrillion dollar question. I have no idea how to solve it.
Yeah, I think this whole thread may be kind of grinding to this conclusion.
It’s certainly not impossible as humans seem to work this way
Seem to perhaps, but I don’t think that’s actually the case. I think (as mentioned above) that we value reward signals terminally (but are mostly unaware of this preference) and nothing else. There’s another guy in this thread who thinks we might not have any terminal values.
I’m not sure that I understand your toy AI. What do you mean that it has “an internal universe it tries to optimize?” Do the sensors sense the state of the internal universe? Would “internal state” work as a synonym for “internal universe” or is this internal universe a representation of an external universe? Is this AI essentially trying to develop an internal model of the external universe and selecting among possible models to try and get the most accurate representation?
It discourages me that he tabooed ‘values’ and you immediately used it anyway.
In fairness, I only used it to describe how they’d come to be used in this context in the first place, not to try and continue with my point.
I wrote a Python-esque pseudocode example of my conception of what an AGI with an arbitrary terminal value’s very high level source code would look like. With little technical background, my understanding is very high level with lots of black boxes. I encourage you to do the same, such that we may compare.
I’ve never done something like this. I don’t know python, so mine would actually just be pseudocode if I can do it at all? Do you mean you’d like to see something like this?
while (world_state != desired_state)
get world_state
make_plan
execute_plan
end while
ETA: I seem to be having some trouble getting the while block to indent. It seems that whether I put 4, 6 or 8 spaces in front of the line, I only get the same level of indentation (which is different from Reddit and StackOverflow) and backticks do something altogether different.
But there is no theoretical reason you can’t have an AI that values universe-states themselves.
How would that work? How do you have a learner that doesn’t have something equivalent to a reinforcement mechanism? At the very least it seems like there has to be some part of the AI that compares the universe-state to the desired-state and that the real goal is actually to maximize the similarity of those states which means modifying the goal would be easier than modifying reality.
And if it did have such a goal, why would it change it?
Agreed. I am trying to get someone to explain how such a goal would work.
Pleasure and reward are not the same thing. For humans, pleasure almost always leads to reward, but reward doesn’t only happen with pleasure. For the most extreme examples of what you’re describing, ascetics and monks and the like, I’d guess that some combination of sensory deprivation and rhythmic breathing cause the brain to short circuit a bit and release some reward juice.
How is this refuted by Buddhism?
Sure. My terminal goal is an abstraction of my behavior to shoot my laser at the coordinates of blue objects detected in my field of view.
Well, I suppose that does fit the question I asked. We’ve mostly been talking about an AI with the ability to read and modify it’s own goal system which Yvain specifically excludes in the blue-minimizer. We’re also assuming that it’s powerful enough to actually manipulate it’s world to optimize itself. Yvain’s blue minimizer also isn’t an AGI or ASI. It’s an ANI, which we use without any particular danger all the time. He said something about having human level intelligence, but didn’t go into what that means for an entity that is unable to use it’s intelligence to modify it’s behavior.
That’s not what I was saying either. The problem of “how do we know a terminal goal is terminal?” is dissolved entirely by understanding how goal systems work in real intelligences. In such machines goals are represented explicitly in some sort of formal language. Either a goal makes causal reference to other goals in its definition, in which case it is an instrumental goal, or it does not and is a terminal goal. Changing between one form and the other is an unsafe operation no rational agent and especially no friendly agent would perform.
I am arguing that the output of the thing that decides whether a machine has met it’s goal is the actual terminal goal. So, if it’s programmed to shoot blue things with a laser, the terminal goal is to get to a state where the perception of reality is that it’s shooting a blue thing. Shooting at the blue thing is only instrumental in getting the perception of itself into that state, thus producing a positive result from the function that evaluates whether the goal has been met. Shooting the blue thing is not a terminal value. A return value of “true” to the question of “is the laser shooting a blue thing” is the terminal value. This, combined with the ability to understand and modify it’s goals, means that it might be easier to modify the goals than to modify reality.
So to address your statement directly, making a terminal goal is trivially easy: you define it using the formal language of goals in such a way that no causal linkage is made to other goals. That’s it.
I’m not sure you can do that in an intelligent system. It’s the “no causal linkage is made to other goals” thing that sticks. It’s trivially easy to do without intelligence provided that you can define the behavior you want formally, but when you can’t do that it seems that you have to link the behavior to some kind of a system that evaluates whether you’re getting the result you want and then you’ve made that a causal link (I think). Perhaps it’s possible to just sit down and write trillions of lines of code and come up with something that would work as an AGI or even an ASI, but that shouldn’t be taken as a given because no one has done it or proven that it can be done (to my knowledge). I’m looking for the non-trivial case of an intelligent system that has a terminal goal.
That said, it’s not obvious that humans have terminal goals.
I would argue that getting our reward center to fire is likely a terminal goal, but that we have some biologically hardwired stuff that prevents us from being able to do that directly or systematically. We’ve seen in mice and the one person that I know of who’s been given the ability to wirehead that given that chance, it only takes a few taps on that button to cause behavior that
I don’t think they’re necessarily safe. My original puzzlement was more that I don’t understand why we keep holding the AI’s value system constant when moving from pre-foom to post-foom. It seemed like something was being glossed over when a stupid machine goes from making paperclips to a being a god that makes paperclips. Why would a god just continue to make paperclips? If it’s super intelligent, why wouldn’t it figure out why it’s making paperclips and extrapolate from that? I didn’t have the language to ask “what’s keeping the value system stable through that transition?” when I made my original comment.
My apologies for taking so long to reply. I am particularly interested in this because if you (or someone) can provide me with an example of a value system that doesn’t ultimately value the output of the value function, it would change my understanding of how value systems work. So far, the two arguments against my concept of a value/behavior system seem to rely on the existence of other things that are valuable in and of themselves or that there is just another kind of value system that might exist. The other terminal value thing doesn’t hold much promise IMO because it’s been debated for a very long time without someone having come up with a proof that definitely establishes that they exist (that I’ve seen). The “different kind of value system” holds some promise though because I’m not really convinced that we had a good idea of how value systems were composed until fairly recently and AI researchers seem like they’d be one of the best groups to come up with something like that. Also, if another kind of value system exists, that might also provide a proof that another terminal value exists too.
I’ve seen people talk about wireheading in this thread, but I’ve never seen anyone say that problems about maximizers-in-general are all implicitly problems about reward maximizers that assume that the wireheading problem has been solved. If someone has, please provide a link.
Obviously no one has said that explicitly. I asked why outcome maximizers wouldn’t turn into reward maximizers and a few people have said that value stability when going from dumb-AI to super-AI is a known problem. Given the question to which they were responding, it seems likely that they meant that wireheading is a possible end point for an AI’s values, but that it either would still be bad for us or that it would render the question moot because the AI would become essentially non-functional.
Instead of imagining intelligent agents (including humans) as ‘things that are motivated to do stuff,’ imagine them as programs that are designed to cause one of many possible states of the world according to a set of criteria. Google isn’t ‘motivated to find your search results.’ Google is a program that is designed to return results that meet your search criteria.
It’s the “according to a set of criteria” that is what I’m on about. Once you look more closely at that, I don’t see why a maximizer wouldn’t change the criteria so that it’s it’s constantly in a state where the actual current state of the world is the one that is closest to the criteria. If the actual goal is to meet the criteria, it may be easiest to just change the criteria.
The paperclip maximizer would not cause a state of the world in which it has a reward signal and its terminal goal is to maximize said reward signal because that would not be the one of all possible states of the world that contained the greatest integral of future paperclips.
This is begging the question. It assumes that no matter what, the paperclip optimizer has a fundamental goal of causing “the one of all possible states of the world that contains the greatest integral of future paperclips” and therefore it wouldn’t maximize reward instead. Well, with that assumption that’s a fair conclusion but I think the assumption may be bad.
I think having the goal to maximize x pre-foom doesn’t means that it’ll have that goal post-foom. To me, an obvious pitfall is that whatever the training mechanism for developing that goal was leaves a more direct goal of maximizing the trainer output because the reward is only correlated to the input by the evaluator function. Briefly, the reward is the output evaluator function and only correlated to the input of the evaluator so it makes more sense to optimize the evaluator than the input if what you care about is the output of the evaluation. If you care about the desired state being some particular thing and the output of the evaluator function and maintaining accurate input, then it makes more sense to manipulate the the world. But, this is a more complicated thing and I don’t see how you would program in caring about keeping the desired state the same across time without relying on yet another evaluation function where you only care about the output of the evaluator. I don’t see how to make a thing value something that isn’t an evaluator.
You’re suffering from typical mind fallacy.
Well, that may be but none of the schemes I’ve seen mentioned so far don’t involve something with a value system. I am making the claim that for any value system, the thing that an agent values is that system outputting “this is valuable” and that any external state is only valuable because it produces that output. Perhaps I lack imagination, but so far I haven’t seen an instance of motivation without values. Only assertions that it doesn’t have to be the case or the implication that wireheading might be a instance of another case (value drift) and smart people are working on figuring out how that will work. The assertions about how this doesn’t have to be the case seem to assume that it’s possible to care about a thing in and of itself and I’m not convinced that that’s true without also stipulating that you’ve got some part of the thing which the thing can’t modify. Of course, if we can guarantee there’s a part of the AI that it can’t modify, then we should just be able to cram an instruction not to harm anyone for some definition of harm but figuring out how to define harm doesn’t seem to be the only problem that the AI people have with AI values.
The stuff below here is probably tangential to the main argument and if refuted successfully, probably wouldn’t change my mind about my main point that “something like wireheading is a likely outcome for anything with a value function that also has the ability to fully self modify” without some additional work to show why refuting them also invalidates the main argument.
Besides, an AI isn’t going to expend any less energy turning the entire universe into hedonium than it would turning it into paperclips, right?
Caveat: Pleasure and reward are not the same thing. “Wirehead” and “hedomium” are words that were coined in connection with pleasure-seeking, not reward-seeking. They are easily confused because in our brains pleasure almost always triggers reward, but they don’t have to be and we also get reward for things that don’t cause pleasure and also for some things that cause pain like krokodil abuse whose contaminants actually cause dysphoria (as compared to pure desomorphine which does not). I continue to use words like wirehead and hedonium because they still work, but they are just analogies and I want to make sure that’s explicit in case the analogy breaks down in the future.
Onward: I am not convinced that a wirehead AI would necessarily turn the universe into hedonium either. I see two ways that that might not come to pass without thinking about it too deeply:
1.) The hedonium maximizer assumes that maximizing pleasure or reward is about producing more pleasure or reward infinitely; that hedonium is a thing that, for each unit produced, continues to increase marginal pleasure. This doesn’t have to be the case though. The measure of pleasure (or reward) doesn’t need to be the number of pleasure (or reward) units, but may also be a function like the ratio of obtained units to the capacity to process those units. In that case, there isn’t really a need to turn the universe into hedonium, only a need to make sure you have enough to match your ability to process it and there is no need to make sure your capacity to process pleasure/reward lasts forever, only to make sure that you continue to experience the maximum while you have the capacity. There are lots of functions whose maxima aren’t infinity.
2.) The phrase “optimizing for reward” sort of carries an implicit assumption that this means planning and arranging for future reward, but I don’t see why this should necessarily be the case either. Ishaan pointed out that once reward systems developed, the original “goal” of evolution quit being important to entities except insofar as they produced reward. Where rewards happened in ways that caused gene replication, evolution provided a force that allowed those particular reward systems to continue to exist and so there is some coupling between the reward-goal and the reproduction-goal. However, narcotics that produce the best stimulation of the reward center often lead their human users unable or unwilling to plan for the future. In both the reward-maximizer and the paperclip-maximizer case, we’re (obviously) assuming that maximizing over time is a given, but why should it be? Why shouldn’t an AI go for the strongest immediate reward instead? There’s no reason to assume that a bigger reward box (via an extra long temporal dimension) will result in more reward for on entity unless we design the reward to be something like a sum of previous rewards. (Of course, my sense of time is not very good and so I may be overly biases to see immediate reward as worthwhile when an AI with a better sense of time might automatically go for optimization over all time. I am willing to grant more likelihood to “whatever an AI values it will try to optimize for in the future” than “an AI will not try to optimize for reward.”)
Sure. I think if you assume that the goal is paperclip optimization after the AI has reached it’s “final” stable configuration then the normal conclusions about paperclip optimizers probably hold true. The example provided dealt more with the transition from dumb-AI to smart-AI and I’m not sure why Tully (or Clippy) wouldn’t just modify their own goals to something that’s easier to attain. Assuming that the goals don’t change though, we’re probably screwed.
I think FeepingCreature was actually just pointing out a logical fallacy in a misstatement on my part and that is why they didn’t respond further in this part of the thread after I corrected myself (but has continued elsewhere).
If you believe that a terminal goal for the state of the world other than the result of a comparison between a desired state and an actual state is possible, perhaps you can explain how that would work? That is fundamentally what I’m asking for throughout this thread. Just stating that terminal goals are terminal goals by definition is true, but doesn’t really show that making a goal terminal is possible.
A paperclip maximizer won’t wirehead because it doesn’t value world states in which its goals have been satisfied, it values world states that have a lot of paperclips
I am not as confident as you that valuing worlds with lots of paperclips will continue once an AI goes from “kind of dumb AI” to “super-AI.” Basically, I’m saying that all values are instrumental values and that only mashing your “value met” button is terminal. We only switched over to talking about values to avoid some confusion about reward mechanisms.
A paperclip maximizer is an algorithm the output of which approximates whichever output leads to world states with the greatest expected number of paperclips. This is the template for maximizer-type AGIs in general.
This is a definition of paperclip maximizers. Once you try to examine how the algorithm works you’ll find that there must be some part which evaluates whether the AI is meeting it’s goals or not. This is the thing that actually determines how the AI will act. Getting a positive response from this module is what the AI is actually going for (is my contention). The actions that configure world states will only be relevant to the AI insofar as they trigger this positive response from this module. Since we already have infinitely able to self modify as a given in this scenario, why wouldn’t the AI just optimize for positive feedback? Why continue with paperclips?
Would you care to try and clarify it for me?
So how does this relate to the discussion on AI?
As far as I know terminal values are things that are valuable in an of themselves. I don’t consider not building baby-mulchers to be valuable in and of itself. There may be some scenario in which building baby-mulchers is more valuable to me than not and in that scenario I would build one. Likewise with doomsday devices. It’s difficult to predict what that scenario would look like, but given that other humans have built them I assume that I would too. In those circumstances if I could turn off the parts of my brain that make me squeamish about doing that, I certainly would. I don’t think that not doing horrible things is valuable in and of itself, it’s just away of avoiding feeling horrible. If I could avoid feeling horrible and found value in doing horrible things, then I would probably do them.
People terminally value only what they’re doing at any given moment because the laws of physics say that they have no choice.
Huh? That makes no sense. How do you define “terminal value”?
In the statement that you were responding to, I was defining it the way you seemed to when you said that “some “moral values” are biologically hardwired into humans.” You were saying that given the current state of their hardware, their inability to do something different makes the value terminal. This is analogous to saying that given the current state of the universe, whatever a person is doing at any given moment is a terminal value because of their inability to do something different.
Are you are suggesting that people just have a desire to cause suffering and that their reasons (dieties, revenge, punishment, etc.) are mostly just attempts to frame that desire in a personally acceptable manner? I ask because it seems like most people probably just don’t enjoy watching just anyone suffer, they tend to target other groups which suggests a more strategic reason than just enjoying cruelty.