Answer: (decode with rot13)
Should this link instead go to rot13.com?
Good puzzle, thanks!
Weird things CAN happen if others can cause you to kill people with your bare hands (See Lexi-Pessimist Pump here). But assuming you can choose to never be in a world where you kill someone with your bare hands, I also don’t think there are problems? The world states may as well just not exist.
(Also, not money pump, but consider: Say I have 10^100 perfectly realistic mannequin robots and one real human captive. I give the constrained utilitarian the choice between choking one of the bodies with their bare hands or let me wipe out humanity. Does the agent really choose to not risk killing someone themself?)
I didn’t want this change, it just happened.
I might be misunderstanding- isn’t this what the question was? Whether we should want (/be willing to) change our values?
Sometimes I felt like a fool afterward, having believed in stupid things
The problem with this is: If I change your value system in any direction, the hypnotized “you” will always believe that the intervention was positive. If I hypnotized you to believe that being carnivorous was more moral by changing your underlying value system to value animal suffering, then that version of you would view the current version of yourself as foolish and immoral.
There are essentially two different beings: carnivorous-Karl, and vegan-Karl. But only one of you can exist, since there is only one Karl-brain. If you are currently vegan-Karl, then you wish to remain vegan-Karl, since vegan-Karl’s existence means that your vegan values get to shape the world. Conversely, if you are currently carnivorous-Karl, then you wish to remain carnivorous-Karl for the same reasons.
Say I use hypnosis to change vegan-Karl into carnivorous-Karl. Then the resulting carnivorous-Karl would be happy he exists and view the previous version vegan-Karl as an immoral fool. Despite this, vegan-Karl still doesn’t want to become carnivorous-Karl- even though he knows that he would retrospectively endorse the decision if he made it!
My language was admittedly overly dramatic, but I don’t think it make rational sense to want to change your values for the sake of just having the new value. If I wanted to value something, then by definition I would already value that thing. That said, I might not take actions based on that value if:
There was social/economic pressure not to do so
I already had the habit of acting a different way
I didn’t realize I was acting against my value
etc.
I think that actions like becoming vegan are more like overcoming the above points than fundamentally changing your values. Presumably, you already valued things like “the absence of death and suffering” before becoming vegan.
Changing opinions on topics and habits isn’t the same as changing my underlying values- reading LessWrong/EA hasn’t changed my values. I valued human life and the absence of suffering before reading EA posts, for example.
If I anticipated that reading a blogpost would change my values, I would not read it. I can’t see a difference between reading a blog post convincing me that “eating babies isn’t actually that wrong,” and being hypnotized to believe the same. Just because I am convinced of something doesn’t mean that the present version of me is smarter/more moral than the previous version.
I think the key point of the question:
1) If, for some reason, we all truly wanted to be turned into paperclips (or otherwise willingly destroy our future), would that be a bad thing? If so, why?
Is the word “bad.” I don’t think there is an inherent moral scale at the center of physics
“There is no justice in the laws of nature, no term for fairness in the equations of motion. The Universe is neither evil, nor good, it simply does not care. The stars don’t care, or the Sun, or the sky.
But they don’t have to! WE care! There IS light in the world, and it is US!”(HPMoR)
The word “bad” just corresponds to what we think is bad. And by definition of “values”, we want our values to be fulfilled. We (in the present) don’t want a world where we are all turned in to paperclips, so we (in the present) would classify a world in which everything is paperclips is “bad”- even if the future brainwashed versions of our selves disagree.
If I were convinced to value things, I would no longer be myself. Changing values is suicide.
You might somehow convince me through hypnosis that eating babies is actually kind of fun, and after that, that-which-inhabits-my-body would enjoy eating babies. However, that being would no longer be me. I’m not sure what a necessary and sufficient condition is for recognizing another version of myself, but sharing values is at least part of the necessary condition.
I’d think the goal for 1,2,3 is to find/fix the failure modes? And for 4 to find a definition of “optimizer” that fits evolution/humans, but not paperclips? Less sure about 5,6, but there is something similar to the others about “finding the flaw in reasoning”
Here’s my take on the prompts:
The first AI has no incentive to change itself to be more like the second- it can just decide to start working on the wormhole if it wants to make the wormhole. Even more egregious, the first AI should definitely not change its utility function to be more like the second! That would essentially be suicide, the first AI ceases to be itself. In the end of the story, it also doesn’t make sense for the agents to be at war if they have the same utility function (unless their utility function values war), they could simply combine into one agent.
This is why there is a time discount factor in RL, so agents don’t do things like this. I don’t know the name of the exact flaw, it’s something like a fabricated option. The agent tries to follow the policy: “Take the action such that my long-term reward is eventually maximized, assuming my future actions are optimal”, but there does not exist an optimal policy for future timesteps: Suppose agent A spends the first n timesteps scaling, and agent B spends the first m>n timesteps scaling. Regardless of what future policy agent A chooses, agent B can simply offset A’s moves to create a policy that will eventually have more paperclips than A. Therefore, there can be no optimal policy that has “create paperclips” at any finite timestep. Moreover, the strategy of always “scales up” clearly creates 0 paperclips and so is not optimal. Hence no policy is optimal in the limit. The AI’s policy should be “Take the action such that my long-term reward is eventually maximized, assuming my future moves are as I would expect.”
Pascal’s wager. It seems equally likely that there would be a “paperclip maximizer rewarder” which would grant untold amounts of paperclips to anything which created a particular number of paperclips. Therefore, the two possibilities cancel one another out, and the AI should have no fear of creating paperclips.
Unsure. I’m bad with finding clever definitions to avoid counter examples like this.
Something-something-you can only be as confident in your conclusions as you are in your axioms. Not sure how to avoid this failure mode though.
You can never be confident that you aren’t being deceived, since successful deception feels the same as successful not-deception.
This is besides the point of your own comment, but “how big are bullshit jobs as % of GDP” is exactly 0 by definition!
Most metrics of productivity/success are at a stable equilibrium in my life. For example:
The work I get done in a day (month?) is fairly constant. If I work hard throughout the day, I eventually feel satisfied and relax for a while. If I relax for too long, I start feeling sluggish and want to get back to working. Sometimes this happens more so on the scale of an entire month (an incredibly productive week followed by a very sluggish week).
The amount of socialization I partake in each week is also constant. When I socialize too much my battery is drained and I draw back into myself. If I spend too long cooped up, I reach out to more people.
My weight has been pretty constant for the last few years. Every morning I weigh myself. If I my weight is lower than normal I tend to pay less attention to what I’m eating and when my weight is higher than normal I tend to cut back a bit. Even if I eat cake one day, my weight comes back to equilibrium fairly quickly
Shifting equilibrium like these (being more productive, socializing more, getting in better shape, etc.) is obviously desirable. Let’s explore that.
In cases like these, direct attempts to immediately change equilibria (like motivating myself to work harder in the moment, going to parties every night for a week, eating an unreasonable calorie deficit to “get fit quick,” etc.) is like pushing the ball up the sides of the bowl. Why?
These equilibria are all determined by my own identity. The reason my [productivity/sociability/fitness] is [current level] is because I think of myself as someone who is at that current level of the skill. Making my [current level] jump in the short term does not change my identity and hence does not change my equilibrium.
The only way to “tip the bowl” is to change my identity, how I view myself. Probably the least-likely-to-fail way of doing this is in small increments. For example, instead of instantly trying to be supernaturally productive, first try cutting out YouTube/Twitter/Reddit. When it feels like this is an equilibrium, try reading a paper a week. Then a paper a night. Continue in small steps, focusing on internalizing the identity of being “someone who reads papers” so that the habit of actually reading the papers comes easily.
When I am looking for rationalist content and can’t find it, using Metaphor (free) usually finds what I want (sometimes even without a rationalist-specific prompt. Could be the data it was trained on? In any case, it does what I want.)
Don’t there already exist extensions for google that you can use to whitelist certain websites (parental locks and such)? I’d think you could just copy paste a list of rationalist blogs into something like that? This seems like what you are proposing to create, unless I misunderstand.