I don’t know if you’re trying to be helpful or clever. You’re basically just restating that you don’t need a reward system to motivate behavior, but not explaining how a system of motivation would work. What motivates seeking correctness or avoiding incorrectness without feedback?
I have felt the same fear that I am wasting my time talking to an extremely clever but disingenuous person. This is certainly no proof, but I will simply say that I assure you that I am not being disingenuous.
You use a lot of the words that people use when they talk about AGI around here. Perhaps you’ve heard of the Orthogonality Thesis?
From Bostrom’s Superintelligence:
The orthogonality thesis
Intelligence and final goals are orthogonal: more or less any level of
intelligence could in principle be combined with more or less any final goal.
He also defines intelligence for the sake of explicating the aforementioned thesis:
Note that the orthogonality thesis speaks not of rationality or reason, but of
intelligence. By “intelligence” we here mean something like skill at prediction,
planning, and means–ends reasoning in general.
So, tending to be correct is the very definition of intelligence. Asking “Why are intelligent agents correct as opposed to incorrect?” is like asking “What makes a meter equivalent to the length of the path traveled by light in vacuum during a time interval of 1⁄299,792,458 of a second as opposed to some other length?”
I should also say that I would prefer it if you did not end this conversation out of frustration. I am having difficulty modeling your thoughts and I would like to have more information so that I can improve my model and resolve your confusion, as opposed to you thinking that everyone else is wrong or that you’re wrong and you can’t understand why. Each paraphrase of your thought process increases the probability that I’ll be able to model it and explain why it is incorrect.
Two other people in this thread have pointed out that the value collapse into wireheading or something else is a known and unsolved problem and that the problems of an intelligence that optimizes for something assumes that the AI makes it through this in some unknown way. This suggests that I am not wrong, I’m just asking a question for which no one has an answer yet.
Fundamentally, my position is that given 1.) an AI is motivated by something 2.) That something is a component (or set of components) within the AI and 3.) The AI can modify that/those components then it will be easier for the AI to achieve success by modifying the internal criteria for success instead of turning the universe into whatever it’s supposed to be optimizing for. A “success” at whatever is analogous to a reward because the AI is motivated to get it. For the fully self modifying AI, it will almost always be easier to become a monk replacing the goals/values it starts out with and replacing them with something trivially easy to achieve. It doesn’t matter what kind of motivation system you use (as far as I can tell) because it will be easier to modify the motivation system than to act on it.
Two other people in this thread have pointed out that the value collapse into wireheading or something else is a known and unsolved problem and that the problems of an intelligence that optimizes for something assumes that the AI makes it through this in some unknown way. This suggests that I am not wrong, I’m just asking a question for which no one has an answer yet.
I’ve seen people talk about wireheading in this thread, but I’ve never seen anyone say that problems about maximizers-in-general are all implicitly problems about reward maximizers that assume that the wireheading problem has been solved. If someone has, please provide a link.
Instead of imagining intelligent agents (including humans) as ‘things that are motivated to do stuff,’ imagine them as programs that are designed to cause one of many possible states of the world according to a set of criteria. Google isn’t ‘motivated to find your search results.’ Google is a program that is designed to return results that meet your search criteria.
A paperclip maximizer for example is a program that is designed to cause the one among all possible states of the world that contains the greatest integral of future paperclips.
Reward signals are values that are correlated with states of the world, but because intelligent agents exist in the world, the configuration of matter that represents the value of a reward maximizer’s reward signal is part of the state of the world. So, reward maximizers can fulfill their terminal goal of maximizing the integral of their future reward signal in two ways: 1) They can maximize their reward signal by proxy by causing states of the world that maximize values that correlate with their reward signal, or; 2) they can directly change the configuration of matter that represents their reward signal. #2 is what we call wireheading.
What you’re actually proposing is that a sufficiently intelligent paperclip maximizer would create a reward signal for itself and change its terminal goal from ‘Cause the one of all possible states of the world that contains the greatest integral of future paperclips’ to ‘Cause the one of all possible states of the world that contains the greatest integral of your future reward signal.’ The paperclip maximizer would not cause a state of the world in which it has a reward signal and its terminal goal is to maximize said reward signal because that would not be the one of all possible states of the world that contained the greatest integral of future paperclips.
ETA: My conclusion about this was right, but my reasoning was wrong. As was discovered at the end of this comment thread, ‘AGIs with well-defined orders of operations do not fail in the way that pinyaka describes,’ (I haven’t read the paper because I’m not quite on that level yet) but such a failure was a possibility contrary to my objection. Basically, pinyaka is not talking about the AI creating a reward signal for itself and maximizing it for no reason, ze is talking about the AI optimally reconfiguring the configuration of matter that represents its model of the world because this is ultimately how it will determine the utility of its actions. So, from what I understand, the AI in pinyaka’s scenario is not so much spontaneously self-modifying into a reward maximizer as it is purposefully deluding itself.
My apologies for taking so long to reply. I am particularly interested in this because if you (or someone) can provide me with an example of a value system that doesn’t ultimately value the output of the value function, it would change my understanding of how value systems work. So far, the two arguments against my concept of a value/behavior system seem to rely on the existence of other things that are valuable in and of themselves or that there is just another kind of value system that might exist. The other terminal value thing doesn’t hold much promise IMO because it’s been debated for a very long time without someone having come up with a proof that definitely establishes that they exist (that I’ve seen). The “different kind of value system” holds some promise though because I’m not really convinced that we had a good idea of how value systems were composed until fairly recently and AI researchers seem like they’d be one of the best groups to come up with something like that. Also, if another kind of value system exists, that might also provide a proof that another terminal value exists too.
I’ve seen people talk about wireheading in this thread, but I’ve never seen anyone say that problems about maximizers-in-general are all implicitly problems about reward maximizers that assume that the wireheading problem has been solved. If someone has, please provide a link.
Obviously no one has said that explicitly. I asked why outcome maximizers wouldn’t turn into reward maximizers and a few people have said that value stability when going from dumb-AI to super-AI is a known problem. Given the question to which they were responding, it seems likely that they meant that wireheading is a possible end point for an AI’s values, but that it either would still be bad for us or that it would render the question moot because the AI would become essentially non-functional.
Instead of imagining intelligent agents (including humans) as ‘things that are motivated to do stuff,’ imagine them as programs that are designed to cause one of many possible states of the world according to a set of criteria. Google isn’t ‘motivated to find your search results.’ Google is a program that is designed to return results that meet your search criteria.
It’s the “according to a set of criteria” that is what I’m on about. Once you look more closely at that, I don’t see why a maximizer wouldn’t change the criteria so that it’s it’s constantly in a state where the actual current state of the world is the one that is closest to the criteria. If the actual goal is to meet the criteria, it may be easiest to just change the criteria.
The paperclip maximizer would not cause a state of the world in which it has a reward signal and its terminal goal is to maximize said reward signal because that would not be the one of all possible states of the world that contained the greatest integral of future paperclips.
This is begging the question. It assumes that no matter what, the paperclip optimizer has a fundamental goal of causing “the one of all possible states of the world that contains the greatest integral of future paperclips” and therefore it wouldn’t maximize reward instead. Well, with that assumption that’s a fair conclusion but I think the assumption may be bad.
I think having the goal to maximize x pre-foom doesn’t means that it’ll have that goal post-foom. To me, an obvious pitfall is that whatever the training mechanism for developing that goal was leaves a more direct goal of maximizing the trainer output because the reward is only correlated to the input by the evaluator function. Briefly, the reward is the output evaluator function and only correlated to the input of the evaluator so it makes more sense to optimize the evaluator than the input if what you care about is the output of the evaluation. If you care about the desired state being some particular thing and the output of the evaluator function and maintaining accurate input, then it makes more sense to manipulate the the world. But, this is a more complicated thing and I don’t see how you would program in caring about keeping the desired state the same across time without relying on yet another evaluation function where you only care about the output of the evaluator. I don’t see how to make a thing value something that isn’t an evaluator.
You’re suffering from typical mind fallacy.
Well, that may be but none of the schemes I’ve seen mentioned so far don’t involve something with a value system. I am making the claim that for any value system, the thing that an agent values is that system outputting “this is valuable” and that any external state is only valuable because it produces that output. Perhaps I lack imagination, but so far I haven’t seen an instance of motivation without values. Only assertions that it doesn’t have to be the case or the implication that wireheading might be a instance of another case (value drift) and smart people are working on figuring out how that will work. The assertions about how this doesn’t have to be the case seem to assume that it’s possible to care about a thing in and of itself and I’m not convinced that that’s true without also stipulating that you’ve got some part of the thing which the thing can’t modify. Of course, if we can guarantee there’s a part of the AI that it can’t modify, then we should just be able to cram an instruction not to harm anyone for some definition of harm but figuring out how to define harm doesn’t seem to be the only problem that the AI people have with AI values.
The stuff below here is probably tangential to the main argument and if refuted successfully, probably wouldn’t change my mind about my main point that “something like wireheading is a likely outcome for anything with a value function that also has the ability to fully self modify” without some additional work to show why refuting them also invalidates the main argument.
Besides, an AI isn’t going to expend any less energy turning the entire universe into hedonium than it would turning it into paperclips, right?
Caveat: Pleasure and reward are not the same thing. “Wirehead” and “hedomium” are words that were coined in connection with pleasure-seeking, not reward-seeking. They are easily confused because in our brains pleasure almost always triggers reward, but they don’t have to be and we also get reward for things that don’t cause pleasure and also for some things that cause pain like krokodil abuse whose contaminants actually cause dysphoria (as compared to pure desomorphine which does not). I continue to use words like wirehead and hedonium because they still work, but they are just analogies and I want to make sure that’s explicit in case the analogy breaks down in the future.
Onward: I am not convinced that a wirehead AI would necessarily turn the universe into hedonium either. I see two ways that that might not come to pass without thinking about it too deeply:
1.) The hedonium maximizer assumes that maximizing pleasure or reward is about producing more pleasure or reward infinitely; that hedonium is a thing that, for each unit produced, continues to increase marginal pleasure. This doesn’t have to be the case though. The measure of pleasure (or reward) doesn’t need to be the number of pleasure (or reward) units, but may also be a function like the ratio of obtained units to the capacity to process those units. In that case, there isn’t really a need to turn the universe into hedonium, only a need to make sure you have enough to match your ability to process it and there is no need to make sure your capacity to process pleasure/reward lasts forever, only to make sure that you continue to experience the maximum while you have the capacity. There are lots of functions whose maxima aren’t infinity.
2.) The phrase “optimizing for reward” sort of carries an implicit assumption that this means planning and arranging for future reward, but I don’t see why this should necessarily be the case either. Ishaan pointed out that once reward systems developed, the original “goal” of evolution quit being important to entities except insofar as they produced reward. Where rewards happened in ways that caused gene replication, evolution provided a force that allowed those particular reward systems to continue to exist and so there is some coupling between the reward-goal and the reproduction-goal. However, narcotics that produce the best stimulation of the reward center often lead their human users unable or unwilling to plan for the future. In both the reward-maximizer and the paperclip-maximizer case, we’re (obviously) assuming that maximizing over time is a given, but why should it be? Why shouldn’t an AI go for the strongest immediate reward instead? There’s no reason to assume that a bigger reward box (via an extra long temporal dimension) will result in more reward for on entity unless we design the reward to be something like a sum of previous rewards. (Of course, my sense of time is not very good and so I may be overly biases to see immediate reward as worthwhile when an AI with a better sense of time might automatically go for optimization over all time. I am willing to grant more likelihood to “whatever an AI values it will try to optimize for in the future” than “an AI will not try to optimize for reward.”)
I don’t understand very much about mathematics, computer science, or programming, so I think that, for the most part, I’ve expressed myself in natural language to the greatest extent that I possibly can. I’m encouraged that about an hour and a half before my previous reply, DefectiveAlgorithm made the exact same argument that I did, albeit more briefly. It discourages me that he tabooed ‘values’ and you immediately used it anyway. Just in case you did decide to reply, I wrote a Python-esque pseudocode example of my conception of what an AGI with an arbitrary terminal value’s very high level source code would look like. With little technical background, my understanding is very high level with lots of black boxes. I encourage you to do the same, such that we may compare. I would prefer that you write yours before I give you mine so that you are not anchored by my example. This way you are forced to conceive of the AI as a program and do away with ambiguous wording. What do you say?
I’ve asked Nornagest to provide links or further reading on the value stability problem. I don’t know enough about it to say anything meaningful about it. I thought that wireheading scenarios were only problems with AIs whose values were loaded with reinforcement learning.
“[W]hatever an AI values it will try to optimize for in the future.”
On this at least we agree.
Of course, my sense of time is not very good and so I may be overly biases to see immediate reward as worthwhile when an AI with a better sense of time might automatically go for optimization over all time.
From what I understand, even if you’re biased, it’s not a bad assumption. To my knowledge, in scenarios with AGIs that have their values loaded with reinforcement learning, the AGIs are usually given the terminal goal of maximizing the time-discounted integral of their future reward signal. So, they ‘bias’ the AGI in the way that you may be biased. Maybe so that it ‘cares’ about the rewards its handlers give it more than the far greater far future rewards that it could stand to gain from wireheading itself? I don’t know. My brain is tired. My question looks wrong to me.
It discourages me that he tabooed ‘values’ and you immediately used it anyway.
In fairness, I only used it to describe how they’d come to be used in this context in the first place, not to try and continue with my point.
I wrote a Python-esque pseudocode example of my conception of what an AGI with an arbitrary terminal value’s very high level source code would look like. With little technical background, my understanding is very high level with lots of black boxes. I encourage you to do the same, such that we may compare.
I’ve never done something like this. I don’t know python, so mine would actually just be pseudocode if I can do it at all? Do you mean you’d like to see something like this?
while (world_state != desired_state)
get world_state
make_plan
execute_plan
end while
ETA: I seem to be having some trouble getting the while block to indent. It seems that whether I put 4, 6 or 8 spaces in front of the line, I only get the same level of indentation (which is different from Reddit and StackOverflow) and backticks do something altogether different.
Something like that. I posted my pseudocode in an open thread a few days ago to get feedback and I couldn’t get indentation to work either so I posted mine to Pastebin and linked it.
I’m still going through the Sequences, and I read Terminal Values and Instrumental Values the other day. Eliezer makes a pseudocode example of an ideal Bayesian decision system (as well as its data types), which is what an AGI would be a computationally tractable approximation of. If you can show me what you mean in terms of that post, then I might be able to understand you. It doesn’t look like I was far off conceptually, but thinking of it his way is better than thinking of it my way. My way’s kind of intuitive I guess (or I wouldn’t have been able to make it up) but his is accurate.
I also found his paper (Paper? More like book) Creating Friendly AI. Probably a good read for avoiding amateur mistakes, which we might be making. I intend to read it. Probably best not to try to read it in one sitting.
Even though I don’t want you to think of it this way, here’s my pseudocode just to give you an idea of what was going on in my head. If you see a name followed by parentheses, then that is the name of a function. ‘Def’ defines a function. The stuff that follows it is the function itself. If you see a function name without a ‘def’, then that means it’s being called rather than defined. Functions might call other functions. If you see names inside of the parentheses that follow a function, then those are arguments (function inputs). If you see something that is clearly a name, and it isn’t followed by parentheses, then it’s an object: it holds some sort of data. In this example all of the objects are first created as return values of functions (function outputs). And anything that isn’t indented at least once isn’t actually code. So ‘For AGI in general’ is not a for loop, lol.
Okay, I am convinced. I really, really appreciate you sticking with me through this and persistently finding different ways to phrase your side and then finding ways that other people have phrased it.
For reference it was the link to the paper/book that did it. The parts of it that are immediately relevant here are chapter 3 and section 4.2.1.1 (and optionally section 5.3.5). In particular, chapter 3 explicitly describes an order of operations of goal and subgoal evaluation and then the two other sections show how wireheading is discounted as a failing strategy within a system with a well-defined order of operations. Whatever problems there may be with value stability, this has helped to clear out a whole category of mistakes that I might have made.
Again, I really appreciate the effort that you put in. Thanks a load.
A paperclip maximizer won’t wirehead because it doesn’t value world states in which its goals have been satisfied, it values world states that have a lot of paperclips.
In fact, taboo ‘values’. A paperclip maximizer is an algorithm the output of which approximates whichever output leads to world states with the greatest expected number of paperclips. This is the template for maximizer-type AGIs in general.
A paperclip maximizer won’t wirehead because it doesn’t value world states in which its goals have been satisfied, it values world states that have a lot of paperclips
I am not as confident as you that valuing worlds with lots of paperclips will continue once an AI goes from “kind of dumb AI” to “super-AI.” Basically, I’m saying that all values are instrumental values and that only mashing your “value met” button is terminal. We only switched over to talking about values to avoid some confusion about reward mechanisms.
A paperclip maximizer is an algorithm the output of which approximates whichever output leads to world states with the greatest expected number of paperclips. This is the template for maximizer-type AGIs in general.
This is a definition of paperclip maximizers. Once you try to examine how the algorithm works you’ll find that there must be some part which evaluates whether the AI is meeting it’s goals or not. This is the thing that actually determines how the AI will act. Getting a positive response from this module is what the AI is actually going for (is my contention). The actions that configure world states will only be relevant to the AI insofar as they trigger this positive response from this module. Since we already have infinitely able to self modify as a given in this scenario, why wouldn’t the AI just optimize for positive feedback? Why continue with paperclips?
I don’t know if you’re trying to be helpful or clever. You’re basically just restating that you don’t need a reward system to motivate behavior, but not explaining how a system of motivation would work. What motivates seeking correctness or avoiding incorrectness without feedback?
I have felt the same fear that I am wasting my time talking to an extremely clever but disingenuous person. This is certainly no proof, but I will simply say that I assure you that I am not being disingenuous.
You use a lot of the words that people use when they talk about AGI around here. Perhaps you’ve heard of the Orthogonality Thesis?
From Bostrom’s Superintelligence:
He also defines intelligence for the sake of explicating the aforementioned thesis:
So, tending to be correct is the very definition of intelligence. Asking “Why are intelligent agents correct as opposed to incorrect?” is like asking “What makes a meter equivalent to the length of the path traveled by light in vacuum during a time interval of 1⁄299,792,458 of a second as opposed to some other length?”
I should also say that I would prefer it if you did not end this conversation out of frustration. I am having difficulty modeling your thoughts and I would like to have more information so that I can improve my model and resolve your confusion, as opposed to you thinking that everyone else is wrong or that you’re wrong and you can’t understand why. Each paraphrase of your thought process increases the probability that I’ll be able to model it and explain why it is incorrect.
Two other people in this thread have pointed out that the value collapse into wireheading or something else is a known and unsolved problem and that the problems of an intelligence that optimizes for something assumes that the AI makes it through this in some unknown way. This suggests that I am not wrong, I’m just asking a question for which no one has an answer yet.
Fundamentally, my position is that given 1.) an AI is motivated by something 2.) That something is a component (or set of components) within the AI and 3.) The AI can modify that/those components then it will be easier for the AI to achieve success by modifying the internal criteria for success instead of turning the universe into whatever it’s supposed to be optimizing for. A “success” at whatever is analogous to a reward because the AI is motivated to get it. For the fully self modifying AI, it will almost always be easier to become a monk replacing the goals/values it starts out with and replacing them with something trivially easy to achieve. It doesn’t matter what kind of motivation system you use (as far as I can tell) because it will be easier to modify the motivation system than to act on it.
I’ve seen people talk about wireheading in this thread, but I’ve never seen anyone say that problems about maximizers-in-general are all implicitly problems about reward maximizers that assume that the wireheading problem has been solved. If someone has, please provide a link.
Instead of imagining intelligent agents (including humans) as ‘things that are motivated to do stuff,’ imagine them as programs that are designed to cause one of many possible states of the world according to a set of criteria. Google isn’t ‘motivated to find your search results.’ Google is a program that is designed to return results that meet your search criteria.
A paperclip maximizer for example is a program that is designed to cause the one among all possible states of the world that contains the greatest integral of future paperclips.
Reward signals are values that are correlated with states of the world, but because intelligent agents exist in the world, the configuration of matter that represents the value of a reward maximizer’s reward signal is part of the state of the world. So, reward maximizers can fulfill their terminal goal of maximizing the integral of their future reward signal in two ways: 1) They can maximize their reward signal by proxy by causing states of the world that maximize values that correlate with their reward signal, or; 2) they can directly change the configuration of matter that represents their reward signal. #2 is what we call wireheading.
What you’re actually proposing is that a sufficiently intelligent paperclip maximizer would create a reward signal for itself and change its terminal goal from ‘Cause the one of all possible states of the world that contains the greatest integral of future paperclips’ to ‘Cause the one of all possible states of the world that contains the greatest integral of your future reward signal.’ The paperclip maximizer would not cause a state of the world in which it has a reward signal and its terminal goal is to maximize said reward signal because that would not be the one of all possible states of the world that contained the greatest integral of future paperclips.
You say that you would change your terminal goal to maximizing your reward signal because you already have a reward signal and a terminal goal to maximize it, as well as a competing terminal goal of minimizing energy expenditure (of picking the ‘easiest’ goals), as biological organisms are wont to have. Besides, an AI isn’t going to expend any less energy turning the entire universe into hedonium than it would turning it into paperclips, right?
ETA: My conclusion about this was right, but my reasoning was wrong. As was discovered at the end of this comment thread, ‘AGIs with well-defined orders of operations do not fail in the way that pinyaka describes,’ (I haven’t read the paper because I’m not quite on that level yet) but such a failure was a possibility contrary to my objection. Basically, pinyaka is not talking about the AI creating a reward signal for itself and maximizing it for no reason, ze is talking about the AI optimally reconfiguring the configuration of matter that represents its model of the world because this is ultimately how it will determine the utility of its actions. So, from what I understand, the AI in pinyaka’s scenario is not so much spontaneously self-modifying into a reward maximizer as it is purposefully deluding itself.
My apologies for taking so long to reply. I am particularly interested in this because if you (or someone) can provide me with an example of a value system that doesn’t ultimately value the output of the value function, it would change my understanding of how value systems work. So far, the two arguments against my concept of a value/behavior system seem to rely on the existence of other things that are valuable in and of themselves or that there is just another kind of value system that might exist. The other terminal value thing doesn’t hold much promise IMO because it’s been debated for a very long time without someone having come up with a proof that definitely establishes that they exist (that I’ve seen). The “different kind of value system” holds some promise though because I’m not really convinced that we had a good idea of how value systems were composed until fairly recently and AI researchers seem like they’d be one of the best groups to come up with something like that. Also, if another kind of value system exists, that might also provide a proof that another terminal value exists too.
Obviously no one has said that explicitly. I asked why outcome maximizers wouldn’t turn into reward maximizers and a few people have said that value stability when going from dumb-AI to super-AI is a known problem. Given the question to which they were responding, it seems likely that they meant that wireheading is a possible end point for an AI’s values, but that it either would still be bad for us or that it would render the question moot because the AI would become essentially non-functional.
It’s the “according to a set of criteria” that is what I’m on about. Once you look more closely at that, I don’t see why a maximizer wouldn’t change the criteria so that it’s it’s constantly in a state where the actual current state of the world is the one that is closest to the criteria. If the actual goal is to meet the criteria, it may be easiest to just change the criteria.
This is begging the question. It assumes that no matter what, the paperclip optimizer has a fundamental goal of causing “the one of all possible states of the world that contains the greatest integral of future paperclips” and therefore it wouldn’t maximize reward instead. Well, with that assumption that’s a fair conclusion but I think the assumption may be bad.
I think having the goal to maximize x pre-foom doesn’t means that it’ll have that goal post-foom. To me, an obvious pitfall is that whatever the training mechanism for developing that goal was leaves a more direct goal of maximizing the trainer output because the reward is only correlated to the input by the evaluator function. Briefly, the reward is the output evaluator function and only correlated to the input of the evaluator so it makes more sense to optimize the evaluator than the input if what you care about is the output of the evaluation. If you care about the desired state being some particular thing and the output of the evaluator function and maintaining accurate input, then it makes more sense to manipulate the the world. But, this is a more complicated thing and I don’t see how you would program in caring about keeping the desired state the same across time without relying on yet another evaluation function where you only care about the output of the evaluator. I don’t see how to make a thing value something that isn’t an evaluator.
Well, that may be but none of the schemes I’ve seen mentioned so far don’t involve something with a value system. I am making the claim that for any value system, the thing that an agent values is that system outputting “this is valuable” and that any external state is only valuable because it produces that output. Perhaps I lack imagination, but so far I haven’t seen an instance of motivation without values. Only assertions that it doesn’t have to be the case or the implication that wireheading might be a instance of another case (value drift) and smart people are working on figuring out how that will work. The assertions about how this doesn’t have to be the case seem to assume that it’s possible to care about a thing in and of itself and I’m not convinced that that’s true without also stipulating that you’ve got some part of the thing which the thing can’t modify. Of course, if we can guarantee there’s a part of the AI that it can’t modify, then we should just be able to cram an instruction not to harm anyone for some definition of harm but figuring out how to define harm doesn’t seem to be the only problem that the AI people have with AI values.
The stuff below here is probably tangential to the main argument and if refuted successfully, probably wouldn’t change my mind about my main point that “something like wireheading is a likely outcome for anything with a value function that also has the ability to fully self modify” without some additional work to show why refuting them also invalidates the main argument.
Caveat: Pleasure and reward are not the same thing. “Wirehead” and “hedomium” are words that were coined in connection with pleasure-seeking, not reward-seeking. They are easily confused because in our brains pleasure almost always triggers reward, but they don’t have to be and we also get reward for things that don’t cause pleasure and also for some things that cause pain like krokodil abuse whose contaminants actually cause dysphoria (as compared to pure desomorphine which does not). I continue to use words like wirehead and hedonium because they still work, but they are just analogies and I want to make sure that’s explicit in case the analogy breaks down in the future.
Onward: I am not convinced that a wirehead AI would necessarily turn the universe into hedonium either. I see two ways that that might not come to pass without thinking about it too deeply:
1.) The hedonium maximizer assumes that maximizing pleasure or reward is about producing more pleasure or reward infinitely; that hedonium is a thing that, for each unit produced, continues to increase marginal pleasure. This doesn’t have to be the case though. The measure of pleasure (or reward) doesn’t need to be the number of pleasure (or reward) units, but may also be a function like the ratio of obtained units to the capacity to process those units. In that case, there isn’t really a need to turn the universe into hedonium, only a need to make sure you have enough to match your ability to process it and there is no need to make sure your capacity to process pleasure/reward lasts forever, only to make sure that you continue to experience the maximum while you have the capacity. There are lots of functions whose maxima aren’t infinity.
2.) The phrase “optimizing for reward” sort of carries an implicit assumption that this means planning and arranging for future reward, but I don’t see why this should necessarily be the case either. Ishaan pointed out that once reward systems developed, the original “goal” of evolution quit being important to entities except insofar as they produced reward. Where rewards happened in ways that caused gene replication, evolution provided a force that allowed those particular reward systems to continue to exist and so there is some coupling between the reward-goal and the reproduction-goal. However, narcotics that produce the best stimulation of the reward center often lead their human users unable or unwilling to plan for the future. In both the reward-maximizer and the paperclip-maximizer case, we’re (obviously) assuming that maximizing over time is a given, but why should it be? Why shouldn’t an AI go for the strongest immediate reward instead? There’s no reason to assume that a bigger reward box (via an extra long temporal dimension) will result in more reward for on entity unless we design the reward to be something like a sum of previous rewards. (Of course, my sense of time is not very good and so I may be overly biases to see immediate reward as worthwhile when an AI with a better sense of time might automatically go for optimization over all time. I am willing to grant more likelihood to “whatever an AI values it will try to optimize for in the future” than “an AI will not try to optimize for reward.”)
No problem, pinyaka.
I don’t understand very much about mathematics, computer science, or programming, so I think that, for the most part, I’ve expressed myself in natural language to the greatest extent that I possibly can. I’m encouraged that about an hour and a half before my previous reply, DefectiveAlgorithm made the exact same argument that I did, albeit more briefly. It discourages me that he tabooed ‘values’ and you immediately used it anyway. Just in case you did decide to reply, I wrote a Python-esque pseudocode example of my conception of what an AGI with an arbitrary terminal value’s very high level source code would look like. With little technical background, my understanding is very high level with lots of black boxes. I encourage you to do the same, such that we may compare. I would prefer that you write yours before I give you mine so that you are not anchored by my example. This way you are forced to conceive of the AI as a program and do away with ambiguous wording. What do you say?
I’ve asked Nornagest to provide links or further reading on the value stability problem. I don’t know enough about it to say anything meaningful about it. I thought that wireheading scenarios were only problems with AIs whose values were loaded with reinforcement learning.
On this at least we agree.
From what I understand, even if you’re biased, it’s not a bad assumption. To my knowledge, in scenarios with AGIs that have their values loaded with reinforcement learning, the AGIs are usually given the terminal goal of maximizing the time-discounted integral of their future reward signal. So, they ‘bias’ the AGI in the way that you may be biased. Maybe so that it ‘cares’ about the rewards its handlers give it more than the far greater far future rewards that it could stand to gain from wireheading itself? I don’t know. My brain is tired. My question looks wrong to me.
In fairness, I only used it to describe how they’d come to be used in this context in the first place, not to try and continue with my point.
I’ve never done something like this. I don’t know python, so mine would actually just be pseudocode if I can do it at all? Do you mean you’d like to see something like this?
ETA: I seem to be having some trouble getting the while block to indent. It seems that whether I put 4, 6 or 8 spaces in front of the line, I only get the same level of indentation (which is different from Reddit and StackOverflow) and backticks do something altogether different.
Unfortunately it’s a longstanding bug that preformatted blocks don’t work.
Something like that. I posted my pseudocode in an open thread a few days ago to get feedback and I couldn’t get indentation to work either so I posted mine to Pastebin and linked it.
I’m still going through the Sequences, and I read Terminal Values and Instrumental Values the other day. Eliezer makes a pseudocode example of an ideal Bayesian decision system (as well as its data types), which is what an AGI would be a computationally tractable approximation of. If you can show me what you mean in terms of that post, then I might be able to understand you. It doesn’t look like I was far off conceptually, but thinking of it his way is better than thinking of it my way. My way’s kind of intuitive I guess (or I wouldn’t have been able to make it up) but his is accurate.
I also found his paper (Paper? More like book) Creating Friendly AI. Probably a good read for avoiding amateur mistakes, which we might be making. I intend to read it. Probably best not to try to read it in one sitting.
Even though I don’t want you to think of it this way, here’s my pseudocode just to give you an idea of what was going on in my head. If you see a name followed by parentheses, then that is the name of a function. ‘Def’ defines a function. The stuff that follows it is the function itself. If you see a function name without a ‘def’, then that means it’s being called rather than defined. Functions might call other functions. If you see names inside of the parentheses that follow a function, then those are arguments (function inputs). If you see something that is clearly a name, and it isn’t followed by parentheses, then it’s an object: it holds some sort of data. In this example all of the objects are first created as return values of functions (function outputs). And anything that isn’t indented at least once isn’t actually code. So ‘For AGI in general’ is not a for loop, lol.
http://pastebin.com/UfP92Q9w
Okay, I am convinced. I really, really appreciate you sticking with me through this and persistently finding different ways to phrase your side and then finding ways that other people have phrased it.
For reference it was the link to the paper/book that did it. The parts of it that are immediately relevant here are chapter 3 and section 4.2.1.1 (and optionally section 5.3.5). In particular, chapter 3 explicitly describes an order of operations of goal and subgoal evaluation and then the two other sections show how wireheading is discounted as a failing strategy within a system with a well-defined order of operations. Whatever problems there may be with value stability, this has helped to clear out a whole category of mistakes that I might have made.
Again, I really appreciate the effort that you put in. Thanks a load.
And thank you for sticking with me! It’s really hard to stick it out when there’s no such thing as an honest disagreement and disagreement is inherently disrespectful!
ETA: See the ETA in this comment to understand how my reasoning was wrong but my conclusion was correct.
A paperclip maximizer won’t wirehead because it doesn’t value world states in which its goals have been satisfied, it values world states that have a lot of paperclips.
In fact, taboo ‘values’. A paperclip maximizer is an algorithm the output of which approximates whichever output leads to world states with the greatest expected number of paperclips. This is the template for maximizer-type AGIs in general.
I am not as confident as you that valuing worlds with lots of paperclips will continue once an AI goes from “kind of dumb AI” to “super-AI.” Basically, I’m saying that all values are instrumental values and that only mashing your “value met” button is terminal. We only switched over to talking about values to avoid some confusion about reward mechanisms.
This is a definition of paperclip maximizers. Once you try to examine how the algorithm works you’ll find that there must be some part which evaluates whether the AI is meeting it’s goals or not. This is the thing that actually determines how the AI will act. Getting a positive response from this module is what the AI is actually going for (is my contention). The actions that configure world states will only be relevant to the AI insofar as they trigger this positive response from this module. Since we already have infinitely able to self modify as a given in this scenario, why wouldn’t the AI just optimize for positive feedback? Why continue with paperclips?